> I wonder why we went down that whole async/await craze with so many languages?...

signal11 · on May 2, 2022

Ron, thank you for your work on Loom, it’s very exciting and I’m looking forward to using it in production code!

As an aside, Ron had a JUG talk 6 months ago which I found really helpful where he went into more detail about why they chose this approach: https://youtu.be/KmMU5Y_r0Uk (27m20s mark, and from 2m50 there’s a more general introduction to Loom).

I’m sure there are other videos/papers as well, but this was a pretty good overview of Java vs other languages’ approach to async.

kevincox · on May 2, 2022

I think you are missing a key benefit of async/await. It can be implemented incredibly efficiently. This is because it is "stackless". Or in other words you know the exact amount of "stack" space required and you can allocate this exactly instead of allocating a real stack which can be very much larger.

For example if I want to implement a "sleep" with async/await I probably only need to store the wake time as state, if I want to make a virtual thread to do the same I likely need to allocate a large stack just in case I use it.

Of course this can be mitigated with stack caching, small stacks, segmented stacks or other tricks. But doing this is still more expensive than knowing how much "stack" you need up-front and allocating only that.

pron · on May 2, 2022

That's not a benefit of async/await as the same could be done with user-mode threads. In fact, that's what we do with virtual threads. But it might be a benefit of async/await in some particular languages.

kibwen · on May 2, 2022

> But it might be a benefit of async/await in some particular languages.

Rather than saying it's a benefit for particular languages, I'd say it's a benefit in particular contexts, e.g. in contexts where you don't have a heap. Of course it's true that some (most) languages don't support such contexts at all (for a host of good reasons), but the languages that do are shaped by that decision.

pron · on May 2, 2022

The use case of interest here is having many concurrent operations (hundreds of thousands or millions). If you don't have a heap, where do you store the (unbounded number of) async/await frames? There are other use-cases where stackless coroutines are useful without being plentiful — e.g. generators — but that's not the use-case we're targeting here (and is probably a use-case of lower importance in general).

Many languages/runtimes want just a single coroutine/continuation construct to cover both concurrency and generators — which is a good idea in principle — but then they, especially low-level languages, optimise for the less useful of the two. I've seen some very cool demos of C++ coroutines that are useful for very narrow domains, and yet they offer a single construct that sacrifices the more common, more useful, usage for the less common one.

There was one particular presentation about context-switching coroutines in the shadow of cache misses. It was extremely impressive, yet amounted to little more than a party trick. For one, it was extremely sensitive to precise sizing of the coroutine frames, which goes against the point of having a simple, transparent language construct, and for another, it simplifies small code that has to be very carefully written and optimised to the instruction level even after the simplification.

kibwen · on May 2, 2022

Yes, I am (perhaps a bit sloppily) using "particular contexts" to refer to particular use cases. And while your use case is the C5M problem, since we're bringing up other languages (which optimize for different contexts) I think it's worth emphasizing that these features also lend themselves to other use cases. Here's an example of using Rust's async/await on embedded devices, for reasons other than serving millions of concurrent connections: https://ferrous-systems.com/blog/async-on-embedded/

> Many languages/runtimes want just a single coroutine/continuation construct to cover both concurrency and generators — which is a good idea in principle — but then they, especially low-level languages, optimise for the less useful of the two.

Notably Rust appears to be the opposite here, as it is first focusing on providing higher-level async/await support rather than providing general coroutine support, but its async/await is implemented atop a coroutine abstraction which it does hope to expose directly someday.

I'm sure you don't need to be told most of this, but I bring all this up to help answer the more general question of why not every language builds in a green thread runtime, and why one approach is not necessarily strictly superior to another.

pron · on May 2, 2022

If generators or embedded devices that don't have threads are indeed the reason for picking one design over the other, the question then becomes why did some languages prioritise those domains over more common ones, even for them?

kibwen · on May 2, 2022

Indeed, to which the answer is: it's a dirty job, but somebody's got to do it. :) As long as C exists, it's worth trying to improve on what C does without giving up on C's use cases. Of course, that doesn't mean that all use cases are equivalently common, nor does it mean that a language like Rust will ever be as widely used as Java, nor does it mean that Java was wrong for integrating virtual threads (I think they're probably the right solution for a language in Java's domain).

lumost · on May 3, 2022

A common theme in rust development is the notion that no one could produce more optimal code by hand. This is a great feature, but in the case of async/await we are sacrificing a lot to get it. To the extent that a user trying to make their first http request with reqwest will now get conflicting documentation and guidance on whether they need tokio and other packages to pull in async.

kevincox · on May 2, 2022

Can you explain how this is done? Is the current stack copied onto the heap (to the size it currently is)? How are new frames allocated once a thread is suspended?

pron · on May 2, 2022

A portion of the stack is copied to the heap when the virtual thread is suspended, and upon successive yields those "stack chunks" are either reused or new ones allocated and form a linked list. When resuming a virtual thread, however, we don't copy its entire stack back from the heap to the stack, but we do it lazily, by installing a "return barrier" by patching the return address, so as you return from a method, its caller (or several callers) is lazily "thawed" back from the heap. This copying of small chunks of memory into a region that's likely in the cache is very efficient.

The entire mechanism is rather efficient because in Java we don't have pointers into the stack, so we don't need to pin anything to a specific address, and stacks can be freely moved around.

SemanticStrengh · on May 2, 2022

I wonder the implications and opportunities for https://github.com/microsoft/openjdk-proposals/blob/main/sta...

U1F984 · on May 3, 2022

Apparently it's already being taken into consideration:

> The optimization should work with Project Loom when it becomes available.

dboreham · on May 2, 2022

> It can be implemented incredibly efficiently.

At the cost of breaking : conceptual model of concurrency ; debugging ; performance analysis ; tracing ; logging

but yeah...great stuff.

Matthias247 · on May 3, 2022

This is a reason that is brought up once in a while - but even after working 10 years in the domain of high-concurrency services I've never seen compelling data that shows clearly whether stackless coroutines are more efficient than stackful ones. People unfortunately rarely write applications in both approaches to tell.

While the "stack is optimally sized" argument exists, it might not always be true: E.g. implementations could require far more memory being required for the "virtual" stack than what is actually required due to implementation challenges. That for example applies in various situations in Rust. Then a more classical stackless implementation which allocates state for each callback on the heap (like if you manually write boost asio code) which have quite some allocation and memcpy churn. And besides that a "virtual stack" might be more fragmented and less cache friendly than a contiguous stack, which also impacts efficiency.

vbezhenar · on May 2, 2022

OS virtual memory ensures that overhead will not be that big. OS will allocate memory page by page as software touches corresponding virtual addresses. So thread stack will use only as much memory as its maximum stack usage requires (rounded by page size). Async/await of course is more efficient, but in real world native stacks might be good enough, especially when RAM is not very expensive.

addaon · on May 2, 2022

"Rounded by page size" is a pretty huge caveat here, though, no? With a 4 kB minimum page size on most platforms, 5M threads is 20 GB of stack virtual mappings, minimum. And cycling through those threads even once will make every page of that 20 GB resident.

Matthias247 · on May 3, 2022

Maybe. Realistically it won't matter, because any real world server would either need a lot more memory anyway to actually handle application-specific concerns or support a much lower number of clients. Keep in mind that even with a tiny and receive buffer of 16kB plus maybe some TLS state of > 30kB per connection the baseline memory usage of doing anything useful is already much higher than 4kB - unless the only thing you want to do is building a large-scale TCP ping service.

lumost · on May 3, 2022

That still means I can effectively use 5 million threads on a small server. Which is effectively 3 orders of magnitude more threads than I can currently run with Java.

astrange · on May 3, 2022

macOS/iOS is a popular platform where the page size is 16KB and RAM is moderately-to-very expensive.

It might be interesting to try something like Mesh (https://github.com/plasma-umass/Mesh) to share pages.

toast0 · on May 3, 2022

macOS/iOS aren't a realistic server platform for high loads. They don't even have syncookies, so anything TCP is out.

closeparen · on May 2, 2022

How does async/await mitigate #1? Interleaved execution is enough to give you data races; you don't need actual parallelism.

pron · on May 2, 2022

Yes, but it requires a special call site (transitively, all the way up the stack) that permits the interleaving, and so cannot sneak into existing code that might implicitly assume no interleaving.

gpderetta · on May 3, 2022

but good old callback-based code still allows for interleaving and AFAIK JS doesn't require any callsite allocation for that.

pron · on May 3, 2022

It does not allow for interleaving. Interleaving means that state can change in the same subroutine.

gpderetta · on May 4, 2022

What I mean is that subroutine can observe its own state being changed even after a call of a non-sync marked function if that function directly or indirectly calls into a closure closing over that subroutine state.

I.e. IMHO async offers very weak reentrancy guarantees that are better enforced via other means (rust-like lifetimes, immutability annotations, atomic contructs, etc).

SemanticStrengh · on May 2, 2022

A major issue with loom is that it consume much more %CPU https://github.com/ebarlas/project-loom-comparison/blob/main... Edit no it is more efficient although it consume surprisingly high gpus at higher throughput than the others.

didibus · on May 2, 2022

Wouldn't that be expected when it also delivers more throughput and better latencies? It's handling more requests concurrently, so I'd expect the CPU usage to be higher, how else could it serve more requests faster otherwise?

SemanticStrengh · on May 2, 2022

Yes indeed I just find the consumption increase a but abrupt after 10K

brokencode · on May 2, 2022

Looking at the graphs, it uses less CPU for a given throughout, so it’s actually more efficient for CPU. It also provides lower latency and higher max throughout. It does seem to require more memory, though.

pron · on May 2, 2022

We expect to improve the memory consumption significantly in future releases. Some things had to be cut to make this release.

SemanticStrengh · on May 2, 2022

The existing data already looks excellent. I wonder if you could leverage SIMD/the vector api for speeding up some things. Or if value types will have an impact.

noncoml · on May 2, 2022

All these are implementation details.

The programmers should be seen as “users” of the language.

What you give here is a list of excuses on why system X doesn’t do what is best for its users.

ecnahc515 · on May 2, 2022

Implementation details are often also defacto features, because the behavior may be relied upon by users. Unless you designed the language up front to consider these things, it's often very much a challenge to tell your users that their code is broken, especially if you did not have a language specification clarifying it.

For languages like Python for example, this is a big issue, and reason why alternative concurrency patterns to async/await haven't made much progress.

noncoml · on May 2, 2022

Totally agree. But that was not the point of GP

vbezhenar · on May 2, 2022

As I grow older, implementation details are all that matters to me in the end. I hate go language, IMO it's ugly and terrible to work with. But its compiler and toolset are golden and I'll use it just because of its implementation details. I don't have time to wait until language developers will implement implementation details I need, if ever. I need to ship software tomorrow.

rileyphone · on May 2, 2022

The compiler and toolset are the user facing aspects of a language like go, how the parser works or internal functions in the standard library would be the implementation details.

noncoml · on May 2, 2022

Hmmm. Implementation details are the stuff that are not visible to you as a “user” of the language and the toolset.

I don’t know why people confuse this so much

Hercuros · on May 3, 2022

I would say that implementation details could be visible to you as a user, but should not be relied upon because they are not part of the documented API.

E.g. it might be visible to you that a certain operation runs quickly on certain inputs, or that a particular output is chosen for a particular input, even though the documentation does not specify the exact output.

pie_flavor · on May 2, 2022

No, explicit pointers are not an implementation detail. Nor is knowing when you are on an OS thread in a language where you are likely to integrate with OS functions that depend on what thread you are on. Different languages exist for different purposes requiring them to solve problems in different ways.

noncoml · on May 2, 2022

Two out of three arguments of the GP are implementation details. You can tell that they are implementation details by the way the language is used.

“It’s difficult to do A given B”

As a user I don’t care about B. I just want A.

For your example, the concept of explicit pointers are orthogonal to threads. That’s why you can have OS threads with explicit pointers.

Just because it’s difficult to make them work together doesn’t mean they are incompatible as concepts.

It’s still an implementation detail.

For example, during Win95 one could argue it’s impossible for a crash program not to crash the whole system.

As a user I don’t care what’s going under the hood. I just don’t like it when my Windows 95 app can crash the system. It is an implementation detail

noncoml · on May 2, 2022

Damn folks. Downvoting is not meant for showing your disagreement

Targeting a specific IR is an implementation detail.

Having explicit pointers and thread are not mutually exclusive concepts. It’s the implementation details that make them difficult.

HN can be so annoying sometimes

yCombLinks · on May 2, 2022

It's not about disagreement with your opinion, it's about your rudeness presenting it.

kjeetgill · on May 2, 2022

Choices come with trade-offs and in turn make languages suitable for different use-cases and users. There's no uniform "best for it's users" choices that apply to all languages.

Well, ironically, the exception that applies to all languages being: "does my code still work?" choices... which is what pron was addressing.

noncoml · on May 2, 2022

I can buy the argument that async/await is a design choice. In which case this is what is “best if the user”

But the GP replied to a comment that was claiming threads of execution are better than async. So in the context of the reply the threads are the “best for the the user”