This is clever, and the incremental approach is a good engineering direction. That said, readers should be aware that the closures approach is a dead end for a large & complex service: unless you can figure out how to serialize the full closure state to data that can be exchanged between machines, you're always going to be RAM-constrained and subject to machine failure.
The loss of state isn't a big deal for a news site, but if you lost the user's state in the middle of a significant process (e.g. dropping the cart mid-checkout), it would be a big problem, even if it was infrequent.
What would be great if a language provided a convenient way to repose closures into data blobs that could be exchanged between machines or sent via the browser (an untrusted channel that would have to be secured with cryptography). That said, it's not obvious how to capture the semantics of the full closure of the system state without significant work (look at how fragile object serialization / pickling is). Definitely a good area for language research; I'm sure there's some advances I'm not familiar with.
One thing that was exciting about Datomic to me (and still is, I just don't have any projects big enough to use it for) is that you get an immutable database state. You can emulate this in any normal database, but it gets harder to enforce the discipline among other members of your team. The basic idea is that you can serialize those same closures by just pointing to "the state of the database at revision #5968", and then, though the database moves on, you can always use that ID to compute the view of that database at that point. It does the same "heavy lifting" that storing these closures is doing, but you can easily share that ID across a distributed service with no "expired links" problems.
It's worth mentioning that you can't send a closure in a non-functional context. That is, if Alice sends a closure to Bob, it cannot any longer be the case that Alice's other operations can mutate Bob's state. So you must be serializing an "orphaned" environment tree with a bunch of closures which point at different nodes of that tree.
You could definitely do this even better by stealing some ideas from Smalltalk: encapsulate all of the states in some computational node (the original notion of "object" in OOP) which interacts with all of the other parts of the system communicate with by message-passing, and nothing else. To change the code on-the-fly, you just swap out the "code part" of some node for a new code part, and perhaps transform the state, queuing up the messages while you do so; then you can start replaying those messages to the new code. The benefit is that now at any time the nodes can move around servers arbitrarily, as long as you've got a good name-resolution service to tell you where the object is now.
In other words: (1) interpret all the things so that code and data are the same; (2) shared-state is your enemy; serialize orphaned states only; (3) you have to explicitly handle the case where someone makes a request while you are sending their closure to another server.
One difficulty with storing closures is storing and transferring functions, in particular between machines. There is progress in this direction, but it is not easy.
I'm curious if you don't mind sharing, how many people are working full time on the HN codebase right now? I don't know if that number is even less than or greater than 1.
I'm also curious if there's a straightforward way to serialize these closures. I'm guessing no or at least they would not be persistent across restarts.
FWIW, if I were working on HN, I would be concerned that at some point, HN will require more hardware resources to scale, and it may require some fundamental architectural changes, like more separation between app logic and DB. I wonder to what degree you have scaling plans, like if we hit capacity X we would need to replace subsystem Y.
There are two people dedicated to HN development. Daniel and Scott. Four other people from the YC software team also contribute, but not full time because they have to also work on other projects: Nick, Brett, Trevor and myself. Other people on the software team: Garry and Dalton.
We actually just upgraded the hardware for HN very recently. A lot goes on behind the scenes and it's a testament to the team that it's all fairly transparent to the community to the point where most of you think nothing changes at all. The point is to keep the community focused on contributing, voting and commenting the very best content for hackers.
It's one of the reasons I was delighted to see Dan write about some of the work they do like reducing expired links. It's a rare look at how much thought goes into what feels relatively simple.
Making the complicated look simple is the hallmark of any great success. You guys are doing absolutely stellar work, every now and then I get a glimpse of what is being done and I feel that HN has definitely changed for the better over the last year or so in a technical sense and the moderation transparency has also greatly contributed to the change in atmosphere.
If you think about what magicians do and how they do it, "making the complicated look simple" is also a nice definition of magic.
In other words, if you're ever thinking, "how could she have known that I'd choose the 7 of diamonds?!" you're probably victim of some implicit thought like, "she couldn't possibly have bought 52 decks, pulled out the 7 of diamonds from each of them, and built a deck consisting only of the 7 of diamonds, so that no matter how I shuffled and cut that deck the top card would be the same. No one in their right mind would spend the time and effort to do that." But, that's exactly what she did. She did a lot of abstracted preparation so that the hard work behind the scenes just vanished at the higher level of performance. That's what a good cook does, it's what a magician does, and it's the entire role of "administration" and "middle-men," theoretically-speaking.
> how many people are working full time on the HN codebase right now?
Less than 1. If you include everything HN-related, that number goes up to 2 or a bit more, depending on how much time kogir has.
> I'm also curious if there's a straightforward way to serialize these closures.
Do you mean to disk, or to the client and back? My sense is that it's technically possible but the I/O overhead would make it not practical. However, that's not based on much knowledge. Perhaps we could do it now that we don't have so many.
> I wonder to what degree you have scaling plans
We have a lot of scaling plans, but I wouldn't call them definite.
Wouldn't exactly describe Hacker News as failing. It allowed one person to run a site for a growing community in his spare time and the sacrifice was the occasional expired link for some of its users. Not only do I think it was a fairly elegant solution, I feel like it was certainly a reasonable trade off. I'm sure pg would have been the first to change it had it actually effected things that truly mattered for a community site like Hacker News: story and comment quality.
I've occasionally mused about the possible value of introspectable and serializable closures. Rather than being memory-only, it would be nice if the weight of keeping them around could be palmed off to the browser using cookies or hidden fields.
To be practical, it would require that the activation record chain kept alive by the closure is reasonably short, that the number of live variables in the chain is fairly small, and there be a reliable way of mapping code references in and out. But I think it can be done.
One of these days, I'm going to implement a toy language with this feature combined with my other favourite, automatically recalculated data flow variables (think: "variables" that work like spreadsheet cells). These guys are highly applicable to data binding, and making them a first-order language concept makes them much more elegant to use.
Storing closures on the browser creates a dependencies on JavaScript, user browser settings. and implementation details of various browsers. At the time HN was written, IE was common and Google Chrome did not exist. Doing arbitrary work on an arbitrary client increases application complexity significantly. Sometimes there's a big payoff. Sometimes there isn't.
Take a look at Termite for Gambit Scheme; its focus is different from yours, but it seems to be very good at serializing stuff (including continuations) and automatically proxying the rest (e.g. file descriptors).
(I don't think this is actually a good idea - intra-datacenter traffic is much faster, upgrading becomes hard in this scheme, and your security model needs to be quite complicated - but I'd be interested in learning what you find.)
What does serialization, encryption, transmission, requesting, retransmission, decryption, and deserialization gain?
+ A few bytes of memory
- Network latency
- Server side IO latency
- CPU load for encryption/decryption
- Maintaining client side code base
for diverse client capabilities
undergoing rapid change
- Server-side code complexity
- More complex debugging due to more points of failure.
The string is a stark data structure and everywhere it is passed there is much duplication of process. It is a perfect vehicle for hiding information. -- Alan J. Perlis, Epigram 34.
As soon as you have user names with an associated history, you're dealing with state. Functional programming is a style of managing state transitions, not avoiding them.
I'm not suggesting I can get a free lunch. I wasn't talking about avoiding state transitions. I wasn't even talking about pure functional programming. By talking about variables captured by closures, I'm implicitly not doing so.
In the context of server programming, by using the word 'statelessness', I meant that the request (from clicking the link) doesn't need to go back to the same server that created the link. Having an in-memory hash-table containing closures or continuations keyed by an ID in the request implies that request needs to ultimately reach back to the same machine, one way or another.
I'm talking about programming the Lisp-style multi-request process using continuations (as described by pg in his essays), except making them work in a stateless way - without the requirement for hash tables full of continuations that need periodic collecting, with the consequence that the links go stale.
A request handler generating such continuations would, of course, need to be simple and lightweight, not deep or complex. Code artifacts and stateful resources would need to be serialized using keys that contain enough information to find or recreate them. Imagine having server-side URLs for every function, for every resource, where doing a request for a function pointer or a resource could potentially open a database connection or load a library. The keys would be analogous to such URLs.
Consider an activation record. It is the storage for local variables, parameters and the function return location (aka continuation, when you twist your mind around it). The activation record for any given function has a particular signature. Let's say one of these local variables is a Customer object, mapped by an ORM from the database. That is something that can be reduced to an ID; if you know that this activation record only ever has a Customer in that slot, you can potentially serialize that object as a single number.
Keep the chain of activation records short, and they act mostly as a path through a tree to the current state of a process or interaction. Function a calls function b calls function c, which returns a result to the client, along with a continuation (or more interestingly, multiple continuations). A serialized continuation would be little more than a record of the current step in the overall process, everything needed to pick up where one left off. I think, with the right level of language abstraction, this can be made very slim.
(PS: I detect a certain amount of exasperation in your tone. I'm not sure what I did to trigger it.)
Thank you for elaborating. It appears that you're thinking about an application in the abstract (e.g. Customer objects) rather than HN in particular. Perhaps that is the source of my dissonance...I don't think it is critical for HN to serve up state in real time, but that's just me.
HN is serving up state, BTW, HATEOAS-style, because your front page is customized to you. Only problem is (was) the expiring links - the page is stateful server-side, not just client-side. This article is about making HN more HATEOAS, that IMO makes it technically serving up more state (parameters on the URLs it embeds in the page, for one). My comment was about a way of having a more HATEOAS semantics with respect to statefulness, but less programming cost server side.
That's what ASP.net ViewState does more or less. There's a lot of overhead and little gain IMO. I think it's best to transfer as little state as possible across the wire for simplicity.
It is not. ASP.net viewstate includes a serialization of a set of controls that, server-side, represent the state of the page.
That's not what I'm talking about.
This is why I think I need to write a toy language example - most people, given a brief outline of what I'm talking about, think I'm talking about something else they're more familiar with, and misunderstand.
> automatically recalculated data flow variables (think: "variables" that work like spreadsheet cells)
That sounds kinda like C# Properties, which can be calculated with arbitrarily complex code at get and set time. Is that what you had in mind? It's commonly used with WPF data binding.
Bog-standard getter/setter methods can also run arbitrary code. I guess where data-flow comes in is if GetterA involves calling GetterB then does the system know to automatically call GetterA after SetterB?
HN runs on a single core. Arc is implemented in Racket, which has green threads, so requests don't block on I/O. Whether that counts as concurrent is sort of definitional, but the CPU isn't processing two requests at once.
Interesting to keep in mind whenever a link that makes front page of HN drives 20-30k clicks to some "scalable" web app spread across a couple racks of machines which is then annihilated under the "load".
There are different kinds of overhead, different kinds of inefficiency.
The HN codebase doesn't have the overhead of parsing hidden POST fields (for example), but it does incur a massive RAM overhead to store all that information as closures.
A run-of-the-mill web app, on the other hand, would incur the overhead of passing state back and forth, and perhaps even store sessions on disk. But it might consume less RAM.
What matters is what kind of inefficiency you're willing to tolerate in exchange for what kind of benefits. RAM overhead is a smart choice if you want your app to be very fast and you can afford to use a lot of RAM. A different organization, however, might choose to incur a bit more code-complexity and slower execution in exchange for other benefits. It all depends on what your priorities are.
Yes there are tradeoffs, but I'm not sure this is all that great of an example: a slower application and harder to maintain code are simply not worth it. Computers are cheap - devs and customers are expensive.
Would it be possible to extend this system to allow you to serialize and deserialize the closures? Then you could store them externally in a "real cache" and let that take care of the expiration, etc. You could then also open up a nice security hole :)
Oh! Taking that a step further, what if you mmap'd an empty file first, say 1MB of zeroes. Then, just start writing closures to it, one after another. When you hit 1MB, use mremap to add another 1MB. And just keep going! The pagefile would become a fossil record of the webserver's access history. The earlier in the file, the older the request, and the less likely it would ever be mapped into physical RAM ever again. On a 64-bit machine with a modern hard drive, you could probably go forever :)
Yes, but that's a very different kind of system. AWS has to deal with a very large number of servers handling authentication for client that aren't trusted by the credential owner.
HN has clients with credentials accessing a single server. It would be much easier to just store the serialized data in Memcached or Redis since the system is already centralized and use a token to look it up. Doing so requires less bandwidth and less messing with cryptography.
Personally, I would be ok with a system I am responsible for giving the client a signed or encrypted token. I would not trust many (any?) crypto systems to have signed/encrypted arbitrary code passed to me from an untreated client.
Note that the client is passing you a token with your signature on it, not the client's signature. This isn't PKI, or even shared-secret encryption like RSA; this is the client receiving an opaque blob and then passing it directly back to the server, and the server verifying the URL's HMAC to prove that A. the non-HMAC part of the URL is byte-for-byte identical to the one the HMAC claims it is; and B. the HMAC contains the server's secret.
I really like this approach, myself (it works really well in Erlang, where closures can be serialized like any other term), so I'll argue in favor of it for a bit:
1. OS package managers (especially those that provide automatic security updates) are, effectively, arbitrary code execution limited solely by signature verification. If you don't trust signature verification, you basically can't trust OS update infrastructure.
And these are actually less secure than URL signing, when you think about it: with a signed URL, you are the signatory, and it's very easy to know if you are you. With OS updates, you have to trust the OS manufacturer has itself granted trusts only to the right entities. (Microsoft could put an update signing key from law enforcement into Windows, letting them push automatic wiretap/rootkit "updates" to selected individuals, etc.)
In other words, the security of a system is derived from its weakest link—and there are links far weaker than URL signing.
2. People already do this a ton—deserializing an opaque blob of data signed by the server and then treating it as if it was something just sitting in the server's memory to begin with. Where? In "signed-cookie session storage", the default session mechanism of both Rails and Django. The only difference is that you're putting the information in the URL (where it belongs, in this case) instead of the session—although you could just as well store a continuation table in the session, and then reference it from the URL, if you liked.
Okay, there's also the fact that you're storing a serialized closure instead of a public route—but in business terms, that's no more dangerous to e.g. the valuable information in your database than storing the user's effective UID in signed-cookie session storage, presuming you have administrator-role users in your system with the ability to delete that data.
The one difference might be if the limited set of all your public API endpoints acts as a slapshod "sandbox" for your server, with you trusting that sandbox to protect your system. Which is to say, if your server can do more harm by executing arbitrary code than an administrator user can do by sending messages to it, you should really look into Docker/BSD jails/etc.
I certainly understand all that. What I am arguing is that while such a system is possible, and can be done securely (as you point out: package managers), and this system is very convenient, I would not trust myself to design a system like this for a random web application I was writing.
Regarding your point #2: the difference, at least in my mind, is that I can serialize/deserialize data such as user ID's, tokens, etc. with much greater security than deserializing and eval()ing arbitrary code. If you managed to fake another user's session ID within a signed cookie, you'd do some damage to my application. If you managed to remotely run arbitrary code on my servers, you'd do a lot more damage.
Would you trust such a system if it was built into the web framework you were using, such that the closure signing/serialization/etc. code was thoroughly tested in many environments (e.g. Seaside)?
Or, actually, is it just the signature-checking code you're worried about writing? nginx (among many other reverse proxies) has a battle-tested signed-URL parsing module[1] available as part of its authentication-time processing. In such a setup, you tell nginx which routes need signature protection, and then nginx will only proxy_pass requests on those routes to your app when the signature is valid, stripping the signature off in the process (so they become regular unsigned requests as far as your app is concerned—with the fact that they made it to your app at all telling you they were signed.)
With that architecture, the only code you'll write is the code to generate signed links that comply with ngnix's expectations (which there might already be a library to do in your language.) Either way, if you screw that part up, you'll just have a bunch of invalid links, not a security hole.
Yes, if my involevment was limited to just creating a closure and passing it to the library that is secure, vetted and tested, then sure. With project like HN, I think all of this would be written from scratch, but if e.g. Django had this feature built in, I would be much more comfortable with it.
I thought the same thing. There certainly have been Scheme systems developed that can serialize closures and continuations (e.g. I found [1], and the systems they cite in the related work section). But it seems tricky to get the garbage collection to work right... somehow you need to tell the program which data should be serialized, and which should be stored in RAM and somehow marked as still being live.
For the security aspect, maybe the server could have a secret key and store a MAC along with the serialized string?
Shouldn't it be possible to create a very compact way of serisalizing the closures, and then using that serialization as an URL parameter?
If you can uniquely identify the function, you can know the variables that are being closed over, so you would "just" have to serialize the function ID + list of values of the variables, and URL-encode it.
Though I haven't done enough Lisp (or even Arc) to evaluate if that's doable without too much effort and fragile magics.
This seems like a pretty good example of perfect being the enemy of good. Would there be that much of a usability conflict if /?page=2 30 minutes ago was slightly different than /?page=2 now?
Not to mention that if a scale of time that large has passed, there's a high chance that the closure-style links expire anyways
Agreed. I hated the "expired link" issue. It constantly sent me back to the front page. I would much rather see a newer version of page 2 than to be forced back to the beginning (after a backspace and a refresh).
Yes, exactly. I'm fine with an imperfect solution to this problem, but I think that seeing a page N which was not the page N I would have seen had I clicked the link 30 minutes ago is far preferable to getting a silly error and being forced to navigate back to the front page.
It's not just about story stream, it was also about different users needing to see different things based on settings, permissions and function outside HN...until recently, the software that powered YC and powered HN were the same.
Like Dan wrote, "Getting rid of those closures was a pain, and it made the code more complicated." The tradeoff pg made was for a bit of inconvenience in user experience for a much simpler architecture that allowed him to run HN as a side project while also running YC and writing the essays we've come to admire.
You might have made a different decision, sure, but I can't say it wasn't the right call for his circumstances.
Disclaimer: I haven't looked at the code, I'm making an educated guess here.
I think it's implemented like this:
function renderPage($stories) {
renderHeaders();
renderStories($stories);
renderMoreLink(storeLink(function() {
renderPage(next30($stories));
}));
}
function handleLinkClicked($id) {
$code = fetchLink($id);
$code();
}
Using storeLink() you save a function, than can be later executed when user clicked a particular link (remember, HN runs as a single process and does not restart itself every time like say, PHP does). This function remembers the context in which it was created - in this case, $stories variable - so all the data required to fulfil the request at later time gets stored with the function.
At a guess (after 1 min grep), this is the original code (from an old copy of the HN src). afaics, that's the code to render the entire table for all the stories (plus the 'more' link), including the html (note the inline 'tr' tags etc).
The point is that all the variable you want for rendering "page 2" are right there in scope.
A closure can capture those, so you don't have to do any extra work serialising them, sending them to the browser, validating them on return, etc.
Closures can seem odd the first time you come across them, but they're pretty mainstream these days (perl, python, ruby, C++, C#, golang all have them (some of them for a very long time :-), as well as the more traditionally-functional languages).
On the HN front page most of the links are static and the important dynamic link is "More".
At one end of the spectrum of asynchronous design, the behavior of "More" is determined after the user has clicked on it. At the other other end of the spectrum, the "More" link's behavior is determined at the time the page is constructed. HN uses the latter approach.
Assuming a click on "More", with early binding, the application only reads it's state once (at the time of page construction). With late binding it reads it's state twice, once for page construction and again to construct the next page.
With early binding, only the name of the function has to be passed back immediately, construction of the function itself can be queued depending on server load (e.g. the function can be built 100ms later) and the user is likely unaffected. With late binding a 100ms delay in page building due to server load happens while the user is waiting.
The cost of the continuation approach is that a lot of continuations may be constructed that never get run. That's often less of a worry for people with garbage collectors.
function adder(x) {
return function(y) {
return x+y;
};
}
var plusthree = adder(3);
var plusnine = adder(9);
plusthree(2); // 5
plusnine(3) // 12
plusthree(4); // 7
adder is a function that returns a closure that closes over x. If the function that's returned were a normal function rather than a closure, the x that you pass to adder would not within the scope of the inner function, and you wouldn't be able to reference it. Calling plusthree or plusnine would simply fail. However, because the inner function is a closure, it remembers x.
Every time you call adder, a new closure is created, with its own version of x, so you don't have to do anything extra like creating an object to store it in. The language runtime takes care of all that boring stuff for you.
Of course, I may be overlooking something, but this sure seems like a case of overeducated engineers overengineering a solution that then underserves users relative to a naive solution.
Individualized closures to keep track of what each user has already seen? Why not just have a single, current ranking of stories for everybody and let me pick how many of them I want on each page (up to a point). So, I have a preference that says I want 100 stories per page. Well then I get 3-1/3 pages worth of non-repeating stories without all the computer science. If I later click the "more" button, I get the second hundred, whatever they happen to be at the time I click "next".
That's easy enough for me to deal with. If it has only been a few minutes since I loaded the front page, most of the second page will be new to me. I don't care if a few of them aren't. I'll just skip past them. I'll have plenty of titles to scan over. But if it has been, say, a day since I loaded the front page, instead of clicking "more", I'll probably just reload the front page. I can take care of that myself. No big deal.
With the new API, of course, we can now just build it ourselves, but all this engineering sophistication that resulted in a fixed, 30-article front page and the maddening inability to ever get far past the first 30, reminded me of Ted Nelson's Xanadu Project, which was so cleverly designed to prevent dead links that it never went anywhere, while the naive Web (dead link?, oh well) changed the world.
My impression is that Hacker News is largely a case of a single over-educated engineer hacking up a side project. Closures were duct tape...not in the "Alabama chrome" sense, but in that they are a stable well understood technology. Files 'on disk' rather than a RDMS are a similar engineering simplification.
Sometimes HN chars a bagel. It's a toaster not a microwave oven.
Hacker News was not meant to be something actually used, it was originally designed to be a way to test Arc[0]. Additionally, the code was meant to be a short as possible[1].
> Of course, I may be overlooking something, but this sure seems like a case of overeducated engineers overengineering a solution that then underserves users relative to a naive solution.
Actually, I'd say that saving and restoring closure _is_ the naive solution. You write code as if you were serving only a single user.
Ted Nelson's Xanadu never went anywhere because they simply didn't release. HN is here to serve you and apparently it serves you well enough for you to be here to make snide comments at those that work to make your life a bit better.
If you are not on the inside of a problem it's super easy to tell the people that are how you'd do a much better job of it you were the one in their place. To all those people I would say: prove it and build a website that serves the HN audience better than HN does right now. I'm sure you'll go places with that.
If you dropped all the value judgements on the people that do the work in your first paragraph then your comment would be a much better one.
"All the value judgments" you accuse me of apparently means a single word, "overeducated", which I used as a tongue-in-cheek synonym for "perhaps a little too sophisticated for their own good in this case". I think you're being a little oversensitive about it, but just in case, let me formally (and literally, not sarcastically) express my gratitude to the makers of HN for creating a website of great value to me.
Now, having said that, this is a website where the discussion of the pitfalls of one dev approach vs another is why we're here, and I was seriously suggesting that this might have been a case where the naive approach of a "less-educated" developer might have worked even better. I wasn't claiming to be a less-educated developer myself (ahem). ;-)
As I said, I could easily be overlooking something, and I won't be building my own Hacker News in outrage over the issue, but it still seems to me that just letting me have more items at a time on each page would solve the problem well enough (for me) with no need for anything fancier.
What may be overlooked is that continuations are a useful way of storing state when using a stateless protocol such as HTTP or when dealing with asynchronous communications in general.
They are useful because they are lightweight. There's no serialization/deserialization, no read/write from persistent storage, no additional tiers, no additional layers. It's a simple mechanism that's provided directly by the programming language in which the application is written.
Out of curiosity, what problem do you experience due to the 30 item constant?
It's not the continuations which are the problem, it's relying on state at all. As the submission describes, storing the state used too much RAM and required pruning. The solution was to reformulate the problem so as not to require state, if possible.
I don't claim expertise, but my understanding is that it is considered difficult to compute without states because the critical question is, "What is the next state?"
Continuations are a way of answering that question asynchronously in regard to the timing of the invocation of the transition function from the current state to the next state. The cost/benefit is the absence of shared global state. My continuation is different from your continuation.
The alternative is to determine the next state synchronously with the invocation of the transition function for the next state by reading a shared state. If I invoke the transition function at time t1 I will get the same result as you would get if you invoked the transition function at t1...assuming of course that we did not both invoke it at the same time [then it's just a matter of convention as to who gets what].
Except that stories can move up and down the list over time, and users do care. Slow browsers like me were hitting dead links precisely because we were slow.
Yes you can still store this much more centrally and cheaply than closures. You can have all stories with a timeline in and do a sort using next-thirty-at-this-timestamp and, despite you visiting every story and then sorting, I'd expect it to be snappy enough for a heavy hit site. Furthermore, you could speed it up with linked lists instead of timestamps too etc.
If I take 20 minutes to browse the first 30 links, and then click "next", do I really want "next 30 as of 20 minutes ago"? If I'm on the second page and hit reload, what's the expected behavior? "Second 30 as of 20 minutes ago" or "second 30 right now"?
I can kinda-sorta see how this implementation might be better in some corner-case, but it overalls leads me to having a confused state about just where I am in the stream.
Disagree. Reddit & (the old) Digg both use a naive implementation that repeats stories and they have/had millions more users than HN (and nobody complained).
I seem to recall that people complained constantly. Reddit at least, if you click the "more" link, at least half the time you would get "there doesn't seem to be anything here." And you bet people complained about that one.
I am the last person to advocate a rewrite but such a thing seems appropriate at this point. Has anyone attempted that in rails, Python, node? Considering the source code is available, should be feasible?
Is there a good reason to use pastebin for these type of posts instead of just putting all of the info in the HN post? The only thing I can think of is that it avoids being grouped into the ask section which I think has some disadvantages.
I originally made it a pastebin link because I didn't want to post a huge off-topic thing into the API thread. Kevin simply linked to that.
There's another reason, though: what I wrote was too long for an HN text post (I forget what the limit is, but this was definitely over it) and I didn't have time to make it shorter.