Fly Machines: An API for Fast-Booting VMs

boardwaalk · on May 25, 2022

When I first starting using AWS a few years ago, having known generally what it was for far longer, I was flabbergasted it was at how slow it was to get an instance booted. I expected much less, thinking about things from first principals, even if you're literally talking about cold booting a physical machine via IPMI. But it seemed like everyone accepted that as the way it was and now I do too. So I'm glad people are still interested in making things fast.

Right now I'm doing Postgres stuff (RDS) and dealing with taking 10+ minutes to boot a fresh instance. I'm tempted to try out fly.io and their Postgres clusters but I'm afraid I'd be spoiled and hate my life after (my job has me stuck in AWS for the interminable future).

I would be interested to know where all that time is being spent in on the AWS side. To be a fly on the wall seeing their full, unfiltered logging and metrics.

boulos · on May 25, 2022

Disclosure: I used to work on GCE.

EC2 has historically not focused much on instance boot time. We did for GCE and drove it down pretty heavily. The post here from fly has a good set of sequence diagrams for "what are the various phases of creating an instance from scratch" that are generally applicable.

I'll note though that different users have different targets. Some people care about "time from request to first instruction ticks over" while others only care about "time from request to ssh'able from the public internet". There's an interesting middle ground of "time from request to being able to talk to other services like GCS or S3".

It's not clear to me what the networking / discovery story is for a Fly Machine that is stopped and then starts. That is, how long does fly-proxy take to update (globally? within a metro?) to add and remove the new Fly Machine? I vaguely recall that only external endpoints support IPv4, so I assume Fly is reserving and registering the internal IPv6 endpoints in the more expensive "create" step and then "start" is just about propagating liveness.

ZeroCool2u · on May 25, 2022

My main gig runs workloads primarily on AWS, but I work with a small company as well that is completely on GCP and I gotta say the difference is night and day in instance allocation and startup time. GCE is so much faster it's infuriating when I've gotta go back to work and sometimes have to wait more than 10 minutes in the worst case for an EC2 instance to finish booting in an EKS cluster that's already running!

dmitriid · on May 25, 2022

> I'll note though that different users have different targets. Some people care about "time from request to first instruction ticks over" while others only care about "time from request to ssh'able from the public internet".

This is the same target: a machine (that usually only has single app on it) shouldn't take more time to boot than a general-purpose consumer PC/laptop.

The reason it takes so darn long to start in so many cases is just how horrendously overcomplicated the whole cloud setup is internally and externally (sometimes for good reasons, sometimes because we don't know better, sometimes just because it really is just overly complicated and overengineered)

nousermane · on May 25, 2022

> shouldn't take more time to boot than a general-purpose consumer PC/laptop.

That's an incredibly easy target. VMs can and should boot much faster than that - just look at firecracker hypervisor.

Even with KVM, if you replace systemd with something small and simple [0] (which you totally should, for single-app VMs), boot times of couple of seconds are within reach.

[0] https://github.com/Yelp/dumb-init

dmitriid · on May 25, 2022

I'm still sad nanokernels or nanokernel-like never took off.

I also remember clicking around Ling (Erlang on Xen, sadly no longer active [1]) where the whole VM could boot up, service the request, and shut down in less time than it takes a cloud to start spinning up an instance :)

[1] https://github.com/cloudozer/ling

derefr · on May 25, 2022

> We did for GCE and drove it down pretty heavily.

As a heavy GCE user, something weird about GCE to me is that instance boot times can be extremely variable — predictable within a given instance group, but unpredictable with even very small changes to e.g. instance sizes within the same instance family. (And this isn’t the instance blocking on getting scheduled onto a hypervisor; I can recognize that point, because it’s when any quota limits hit to potentially kill the instance provision. That phase of the delay is very stable.)

The variable delays also seem to apply to “reset” of the instance (which won’t involve an even-temporary deschedule, as reset keeps NVMe state) — but not to kexec-reboot, if one opts for that.

Is GCE built on a hybrid of two different hypervisor systems with wildly differing boot-time performance characteristics, where subtle tweaks of your instance config determine which hypervisor you get? Maybe one that relies on a hardware offload for something (PXE kernel signature verification?) that the other does purely in software?

One thing that’s clear to me is that instance types introduced after a certain point (e.g. all n2d instances) are always of the fast-booting type. So I’m guessing this is just old hypervisors being stuck on some legacy config because of long-term customers partially pinning those host machines with workloads that somehow prevent live migration (workloads with NVMe or GPU), and so make those hypervisors unable to be drained for a hard-upgrade.

tptacek · on May 25, 2022

There's not much about how our system operates that we're unwilling to talk about. But as you're probably aware, the broad questions of how networking and discovery work are big, so I wouldn't know where to start. Feel free to shoot questions at us, though!

thdxr · on May 25, 2022

what's funny is firecracker, the underlying tech fly is using, was created by AWS

only used by Lambda at the moment though, not in ec2

seabrookmx · on May 25, 2022

And firecracker is a fork of crosvm, created by Google to run the Android app and Debian dev layers in ChromeOS, of all things.

Open source is so cool when it works.

jamesfinlayson · on May 25, 2022

I'm not intimately familiar with all of the AWS infrastructure around a running EC2 but I imagine a lot of time is spent on creating the associated Elastic Block Storage, allocating an Elastic IP Address, creating an Elastic Network Interface, creating a default security group etc, then attaching all of those things to the EC2, then attaching the required resources to an Internet Gateway etc.

pugz · on May 25, 2022

I think that's an artifact of RDS specifically. It's dreadfully slow. An EC2 instance will launch and have SSH connectivity in 17-22 seconds in my testing. (I was testing this a fair bit a while back for a silly idea I had)

chrisweekly · on May 24, 2022

There's something about the tone and content of fly.io blog posts that makes it impossible for me not to root for them. (It also helps that the DX is so great.) I've only had a chance to deploy toy apps to Fly.io, nothing at scale, yet, but it checks all my boxes.

punnerud · on May 24, 2022

“ My M.O. with these posts: write it like it was an HN comment, and then edit the sentences to be shorter. I'm glad people like it, though.”

From the button of the comment: https://news.ycombinator.com/item?id=26747701

He is often on HN, look at the karma: https://news.ycombinator.com/user?id=tptacek

tptacek · on May 24, 2022

Kurt and Chris wrote this, not me. :)

ushakov · on May 25, 2022

wait, you have hired people to post with your HN account?

(YC-Founders take notes)

linux2647 · on May 25, 2022

I think tptacek is referring to the article was written by Kurt and Chris, not the HN comments

ushakov · on May 25, 2022

it says "Thomas Ptacek" is author though?

only further confirms my assumption that tptacek has an army of clones

tptacek · on May 25, 2022

Fuck no. If you asked me whether I valued Fly.io more than my HN account, I'd have to think about it. I have, uh, an HN problem. We actually hoped not to see this post on the front page! We have a mode of writing for HN ("has to be interesting for people who will never use Fly.io") and there's a Fly Machines post in the works that fits that model. At any rate: I had very little to do with this post; if you liked it, you like Chris Nicoll, who writes for us professionally. And Kurt, of course, who wrote the original guts of this post, and also has been beating the Fly Machines drum inside of Fly.io for most of the last year.

ushakov · on May 25, 2022

fair enough!

although, as a fellow HN addict i wish i’ve had a private army of comment posters so i don’t have to

nice job on karma though, do you still remember at which point they start paying dividends? :D

tptacek · on May 25, 2022

A private army of HN commenters for me would be like a private army of people to take bath salts for a bath salts addict; it'd defeat the purpose. :)

lifeisstillgood · on May 25, 2022

But if you were part of the private army of HN commentators, well, you would say that now, wouldn't you ?

:-)

(Post by HNoutsourced.com, why not sign up today?)

ushakov · on May 25, 2022

fight the addiction by outsourcing it

now, that's my style

conradfr · on May 25, 2022

The DX is great until you think of using their rest api (not for machines) and the link Google gives you to their docs is a very very incomplete page with even the base url obselete and you're left browsing the forum to understand why your requests don't work ;)

michaeldwan · on May 25, 2022

What rest api is that? If it’s not the machines api it’s in graphql at api.fly.io/graphql.

conradfr · on May 25, 2022

The one at https://api.fly.io/api/v1/ ?

michaeldwan · on May 25, 2022

That’s an old api we need to sunset. How did you stumble on it?

conradfr · on May 25, 2022

By clicking on the first Google result for "fly.io rest api" ;)

michaeldwan · on May 25, 2022

Ah!!! Thanks for pointing that out, we'll take it down.

asim · on May 24, 2022

Now they've got my attention. This is incredibly difficult to execute on. Kudos to the team there who figured it out. If fly is or can become profitable then they've got a chance at being around for a long time. I can see them as the new cloudflare.

gsanderson · on May 24, 2022

They are :) You might want to check this community discussion about their funding and longevity for more https://community.fly.io/t/funding-and-longevity/1957/2

mcintyre1994 · on May 24, 2022

> Fly Machines will help us ship apps that scale to zero sometime this year.

I think this is what will make Fly really exciting. Right now (if I understand right) you need to be paying for a VM 24/7 in every region you want your app available in, because it only scales down to 1. So it runs apps in regions close to users that you're willing to pay for 24/7. If they make scale-to-zero work in every region, then maybe you can just make every app global and if you have some occasional users in Australia then it can just spin up over there while you're getting requests. I think it's what will make many-regions feasible for every app.

gsanderson · on May 24, 2022

Currently you can scale down to 1 in total, rather than down to 1 per region. But yep, scaling down to 0 will be even better.

mcintyre1994 · on May 24, 2022

Ah my mistake, that's already really good! :)

ryanianian · on May 24, 2022

> turns Docker images into running VMs

I honestly don't understand what's going on here. I thought we turned to Docker/containers because VMs were too heavy? Now we've got VMs that run Docker? (Not trying to be dense - what is the advantage?)

ignoramous · on May 24, 2022

> thought we turned to Docker/containers because VMs were too heavy?

VMs are lightweight now, though containers are lighter still; ref: https://fly.io/blog/sandboxing-and-workload-isolation/

Btw, Docker wasn't about security as much as it was about "package once, run anywhere".

> Now we've got VMs that run Docker?

Fly.io doesn't run Docker as-is, but rather unpacks it and runs it in a guest through containerd; ref: https://fly.io/blog/docker-without-docker/

> ...what is the advantage?

This has been discussed numerous times, and here's a link to one such discussion: https://news.ycombinator.com/item?id=26747701

Read also: https://gruchalski.com/posts/2021-03-03-thoughts-on-creating...

qbasic_forever · on May 24, 2022

IMHO with fly.io their use of containers is more for the dev experience. It's incredibly easy and popular to whip up a Dockerfile, test it locally, ship it to a registry, etc. Anyone can learn the Dockerfile syntax and be productive with it in an afternoon.

The tooling for proper VM creation on the other hand is in the stone-age comparatively--there are just a few tools like packer or a frankenstein of ansible scripts and neither are as nice or easy as Dockerfile creation.

regularfry · on May 25, 2022

Before Docker there was Vagrant. Still works, last I looked.

qbasic_forever · on May 25, 2022

It's a frankenstein system of provisioning though. You have to write a config file in Ruby and embed bash scripts, which may or may not do things like invoke ansible scripts (written in YAML). The layers of complexity are immense, and that's before you have to start janitoring virtualbox's always half broken state. The inner loop of change vagrant config -> see result in VM is painfully slow too, waiting for the VM to tear down and setup again.

Docker is much, much simpler. Write a Dockerfile that's mostly just bash or shell code. Build and run in seconds to immediately see the results. Once it's working push the container image to a public registry and you can distribute it to anything.

zrail · on May 24, 2022

Firecracker is a thin layer on top of KVM. It essentially implements just a handful of devices and it boots in milliseconds. Fly bakes a Docker image into the format that Firecracker expects and then boots it, alongside a bunch of anycast networking magic. You get the security guarantees of KVM with the developer experience of docker.

viraptor · on May 24, 2022

Not quite mentioned in other answers, but historically, the VMs we used to run were heavy. A typical qemu-kvm machine needs to actually boot up, initialise, start a number of services, etc. Firecracker is not that - it essentially gives you a kernel that already knows the environment and can do bare minimum before executing the provided image. It's like a halfway point between unikernels and independent VMs. The VM technology itself is not necessarily heavy - it just depends how you use it.

wmf · on May 24, 2022

Everybody hypnotized themselves into believing that containers are not secure and can never be made secure so they run one container per VM. Instead of investing in making containers secure, the industry decided to invest in making VMs ligher, so VMs are now efficient enough that you can run one container per VM.

Why run Docker in VMs instead of using VM images? Because Docker's build tools are more popular than Packer-style tooling.

tptacek · on May 24, 2022

I don't know if hypnosis works or not, but quitting smoking is good whether or not you hypnotize yourself to do it, and so is avoiding multi-tenant Docker. The broad kernel attack surface is much too scary to expose directly to multi-tenant workloads, and there have been fairly recent kernel LPEs that would have avoided any sane system call filter you could come up with.

It's a moot point, because this is a solved problem. Use containers for single-tenant workloads; use micro-VMs, whichever flavor you like best, for multi-tenant.

zachthewf · on May 24, 2022

What is the status of hardware acceleration for micro-VMs? I know that Firecracker doesn't have GPU support yet, are there any other options that handle this?

tptacek · on May 25, 2022

It's a good question. GPUs are in some ways a difficult case for the Firecracker model, which prizes a minimal, mostly memory-safe attack surface that (critically) is easy to reason about. We'd very much like to get an instance or machine type that supports GPUs, but we perceive it as a Big Project. We might not even use Firecracker to do it when we finally get it rolling.

If you're reading this and have big thoughts on how we might do GPUs at Fly.io without keeping us up at night about security, you should reach out; we're hiring.

viraptor · on May 24, 2022

> hypnotized themselves into believing that containers are not secure

They provide any extra layer of indirection which helps with usual exploit attempts, but also introduce new scope. We've had exploits specifically targeting the namespaces API already.

solarkraft · on May 24, 2022

> We've had exploits specifically targeting the namespaces API already

Well, isn't that what happens when you put a shield into place? Someone tries to break it. Why have people concluded that it can never be made properly secure?

tptacek · on May 24, 2022

Because the broad kernel attack surface is huge, and the shield has to reliably protect all of it, or all you've done is create a jungle gym for vulnerability researchers. The win with virtualization is that it drastically scopes down the amount of kernel code exposed to untrusted code.

tremon · on May 27, 2022

Everybody hypnotized themselves into believing that daemons are not secure and can never be made secure so they run one daemon per container. Instead of investing in making daemons secure, the industry decided to invest in making daemons heavier, so each daemon can now be provided with its own hardware/OS abstraction layer.

Why run daemons in containers instead of using proper process isolation? Because containers absolve the system administrator from understanding their systems.

foobiekr · on May 24, 2022

The issue with container security is that the linux kernel boundary is porous and probably not possible to secure in depth.

freedomben · on May 25, 2022

I don't necessarily disagree with "the linux kernel boundary is porous and probably not possible to secure in depth", but IIUC the hypervisor is part of the kernel too, so wouldn't it have the same problem?

The win would be in the attack surface area. For hypervisors there's a good layer of abstraction to pivot over, whereas with containers it's a much thinner wall.

tptacek · on May 25, 2022

KVM is a much smaller chunk of the kernel than the entire system call interface, along with every exposed device and its suite of ioctls.

foobiekr · on May 25, 2022

KVM presents a very narrow, well understood surface that is much much less of a moving target and changes at a much slower rate. Qemu has traditionally been a problem which is why Amazon and google have moved away from it.

The container kernel surface is just insane by comparison.

zokier · on May 25, 2022

Not everybody. Google notably uses gvisor instead of vms

ithkuil · on May 25, 2022

Which is an alternative approach to providing a dedicated kernel to the VM/container. (Because that's what basically a hypervisor does). Gvisor effectively implements a Linux kernel in user space, written in a memory safe language. A kernel that insulates all system calls from the host kernel, by literally implementing wrappers around host system calls.

yencabulator · on May 25, 2022

As far as I can read between the lines, Google has a non-open-source backend for gVisor, and it really wouldn't surprise me if that was KVM too.

dilyevsky · on May 24, 2022

Containers are still more lightweight (tho not by a lot these days) but they are hella insecure for untrusted workloads. Plus people like and depend on docker workflows hence taking docker container (basically just a tarball+json manifest) and making a VM out of it

rstupek · on May 24, 2022

Isolation would be the biggest advantage so they can host multiple clients on the same machine. Fly uses a lightweight vm (someone chime in if they have better details).

tptacek · on May 24, 2022

https://fly.io/blog/sandboxing-and-workload-isolation/

https://fly.io/blog/docker-without-docker/

RcouF1uZ4gsC · on May 24, 2022

What an exciting time to be a developer!

I am so excited about the future. We are seeing a bunch of announcements from multiple companies that make it possible for a single developer or small team to fairly cheaply run a global service without spending a whole lot of time on ops.

I am excited to see what people will come up with.

Havoc · on May 24, 2022

Really like the recent handful of smaller companies announcing more sorta serverless style building blocks.

It’s one of the major pluses of the big clouds yet their pricing isn’t always awesome. Smaller player can help push that down.

See also the DO announcement today. Probably won’t use that but glad about it anyway

bogomipz · on May 24, 2022

The post states:

>"We're not done. You need something to run, right? Firecracker needs a root filesystem. For this, we download Docker images from a repository backed by S3. This can be done in a few seconds if you're near S3 and the image is smol."

I feel like I am missing something. If an S3 bucket is a requirement and I was interested in the isolation provided by Firecracker why wouldn't I just use AWS Fargate or Lambda which are both powered by Firecracker? If low latency was the concern, I can't imagine there being any lower latency than having my workload and storage being colocated in the same AWS Availability Zone.

mrkurt · on May 24, 2022

That is talking about our S3 buckets, when you use these you don't know you're using S3.

Fargate and Lambda are not as consistently fast to boot VMs. Fargate, in particular, can take minutes to get a container launched.

This is not because they're bad services, it's because they make different tradeoffs than we do. When you ask for a Fargate container (or a new Lambda "instance"), AWS actually moves other containers/lambdas out of the way to get you running. Most of the wait time is their infrastructure doing orchestration magic to match your thing to their available compute.

Fly Machines don't do any of this. If you try and start a machine and there's no capacity for it, you get a very fast error response instead. This works well for our early customers. Most of them want to start a process quickly enough for a good UX. Fast errors give them a chance to do that.

lewisl9029 · on May 25, 2022

I was really excited when reading this, but realized the lack of a faster "warm" start makes this less ideal for my highly latency-sensitive use case on Lambda. Lambdas start much faster than 300ms when warm IME, and I'm hoping with enough sustained traffic (be it real or artificial), most requests will be warm.

I'd love to be able to supply some kind of memory snapshot in addition to the docker image to cut down on cold starts. Probably blocked on snapshot support in Firecracker according to another thread? Eagerly awaiting this since it could make Fly Machine the best of both worlds!

Not a fan of how Lambda makes me scale memory and compute in tandem, when my use case benefits so much more from compute than memory. I basically have to pay for 2+ gigs I'm never going to use to get the compute performance I want. Makes 0 sense.

nicoburns · on May 25, 2022

> Lambdas start much faster than 300ms when warm IME.

My understanding is that Lambdas aren't ever really truly warm unless you have a completely steady traffic level. The first request in a while will hit the latency spike. But so will every increase in concurrency. So if you were serving 2 req/sec, and a 3rd concurrent visitor comes along then they will also get a cold start.

If you have a low-latency use case then fly.io's regular VMs are much better than either this or lambda. You get permanently running VM (the smallest of which is $2/month for 256mb/RAM), which can serve more than one request all by itself and will auto-scale with traffic.

lewisl9029 · on May 25, 2022

I do actually already use Fly for pretty much everything else.

For this use case though, I forgot to mention that it needs _much_ faster autoscaling than what Fly's regular VMs offer, with unbounded concurrency, and not ideal to run concurrently in a single VM due to each request being compute heavy and needing full isolation from each other since they run arbitrary customer code.

It's true that with Lambda, some amount of cold starts are probably inevitable with extreme spikes in traffic. But I'm hoping to mitigate most of that by sending artificial concurrent traffic on a schedule to keep a decent buffer of warmed up Lambdas above the current real traffic level. Still to be seen if that plan works out in practice.

nicoburns · on May 25, 2022

Hmm... If you're willing to run the scheduler and/or something sending artificial traffic yourself (and are willing to pay for a few warm instances), then it seems like you might be willing to get something very low-latency with these Fly Machines. You could maintain a pool of a few booted and ready to go (but idle, waiting for a request to come in) at all times. When a request comes in you pass that off to an already-warm machine and boot a new one ready for a later request.

Though I think ultimately it's going to be impossible to have all 3 of:

- Ability to deal with large traffic spikes

- Low-latency

- No provisioned resources

lewisl9029 · on May 25, 2022

For my use case, the main benefit of using manually pre-warmed Lambda instances over Fly Machines is the fact that "warm" Lambdas cost practically nothing (only have to pay for the warming events that are billed for milliseconds every few minutes). :)

This is why I want to see a similar mode of operation for Fly Machines, possibly through memory snapshots, so I can manually provide a "warm" suspend state for it to unsuspend into. In fact this would be even better than the lambda model since there would be _no_ cold starts.

ushakov · on May 24, 2022

really great announcement

as far as i understand this will let me run VMs with specified Docker images?

i'm thinking of using something Fly.io to offer a dedicated hosting for my upcoming product, so when the customers sign up they get a new machine with an individual endpoint

the workload that needs to be running on those machines is quite intensive (like crawling web pages) and not very scalable when sharing resources

also can you give more details about your Nomad stack?

i was actually thinking of using Kubernetes or Docker swarm as API to deploy these workloads

mrkurt · on May 24, 2022

This is separate from the Nomad stack. When you run `fly launch` you get a Fly App that's orchestrated by Nomad and manages something-like-fly-machines for you. Fly Machines are nearly orchestration free.

Machines are designed to work well for your customer hosting! You can install a machine for them, and then turn it on when they push a button, or have it turn on automatically when they visit a URL.

I'm happy to talk about it more. Feel free to send me/us an email!

ushakov · on May 25, 2022

sounds great! the question about Nomad was not related to Fly Machine

btw., i've sent a mail already (mish at ushakov), but only got an automated message

i'm already using free fly.io for wikinewsfeed.org and very happy so far

would be awesome if you could send me a tip how i could use fly.io to deploy instances for my customers?

in my use-case i want to run many instances of Chrome at the same time

running like 100x instances of Chrome on a single machine is too resource-intensive and having a bigger host machine won't do the job, so your only option is to have a dedicated vm for each Chrome instance

perk · on May 25, 2022

I need the same thing :) Basically a VM for each CPU-bound process I need to start from a queue. Did you find anything useful?

ushakov · on May 25, 2022

yes! a fellow HN user e-mailed me about his project "Spawner"

https://github.com/drifting-in-space/spawner

check out the demo: https://www.youtube.com/watch?v=aGsxxcQRKa4

perk · on May 25, 2022

Thanks for sharing!

That looks promising, but I don't want to handle any of this myself :/

I want a service where I can start a new VM fast by posting to an API, have it run some long running JS server code, and the VM should close itself when it's done.

My use case is often CPU bound, so a small VM with a single CPU is just fine.

ushakov · on May 26, 2022

we had a chat today and that's what his (YC-funded) company does

https://jamsocket.com

they're in private beta

perk · on May 27, 2022

Thanks, it looks exactly like what I'm after. Signed up for the private beta.

ignoramous · on May 25, 2022

> Feel free to send me/us an email!

...still waiting for a reply 2 years on ;)

bluelightning2k · on May 25, 2022

Do you have recommendations for stateful workloads? Would the answer always be 'connect to an external DB/API for all state'?

E.g. if I need to run a bunch of processing, would it be A) spin up the micro-VM and pull from a queue service B) embed SQLite C) use some kind of in-memory store

TBH I've been waiting for years for someone to do 'firecracker as a service'. I must have searched that exact term about once per month.

rad_gruchalski · on May 25, 2022

What do you mean firecracker as a service? Fly and Koyeb are exactly that, no?

perk · on May 25, 2022

Does this mean I can spin up multiple instances of _the same_ application on the fly, each running on it's own VM?

For example, we have a queue that handles video encoding. I would like to have 0-N encoders running at the same time, based on demand.

Spin up time is important as well, since I typically provide test renders triggered from the UI.

NamelessChic · on May 25, 2022

Fly can do this without the new Machines functionality, it's a big selling point for them.

You set a min and max number of VMs to run. Set the watermark for users per machine. And they automatically scale up and down depending on the number of connections. It will even figure out where in the world all the connections are coming from, and spin up the new VM in a region near to them.

So if your video encoding was requested via HTTP(S) connections then this would be trivial.

https://fly.io/docs/reference/scaling/#autoscaling

lewisl9029 · on May 25, 2022

Right, I was wondering about this too. There are mentions of instances vs machines, so I'm hoping it's possible to spin up multiple instances of the same machine to run tasks concurrently, but I haven't found this explicitly confirmed anywhere yet.

perk · on May 25, 2022

Please reply here if you find out, I'll do the same :)

viraptor · on May 24, 2022

Ok... So what are the tiktok accountants? All the bad financial takes on tiktok or something else?

rlk · on May 25, 2022

> Sex workers, who have long been censored by moderation systems, refer to themselves on TikTok as “accountants”

https://www.washingtonpost.com/technology/2022/04/08/algospe...

zrail · on May 24, 2022

...something else.

Edit: the meme goes something like "no one asks follow up questions if you say you're an accountant"

funstuff007 · on May 25, 2022

What's the DB / compute break-even for this use case? I assume if you app uses 90% of CPU cycles on DB access, this is not the way to go. And if your app is 90% compute this is a nice solution.

dilyevsky · on May 24, 2022

Does Fly implement live migration under the hood?

michaeldwan · on May 24, 2022

If you mean moving (or copying) VMs to another region/host, sorta. Stateless compute is easy; our apps platform (orchestrated by Nomad) already does it. Volumes complicate things because they live on a specific host, and moving volumes between hosts is a slow and tedious process. Solving this is high on our priority list. We need super fast volume forking and host migrations like yesterday.

dilyevsky · on May 25, 2022

Yeah I was referring to VM snapshot transfer vsphere/openstack/gcp instance style.

> our apps platform (orchestrated by Nomad) already does it

Preserving internal vm state or just restart everything?

michaeldwan · on May 25, 2022

Ah! I've had volumes on the top of my mind all week... Firecracker supports snapshots in a dev preview but we're not using it yet. We can't really do anything like that on Nomad, which is one of the many reasons we're keen to get off it.

mrkurt · on May 24, 2022

We do not. I'd like to, but we have a lot more infrastructure work to do before it's even possible. At the moment, we don't even migrate persistent disks or IP addresses between hosts.

dilyevsky · on May 25, 2022

Interesting, I can see Firecracker doesn’t even support snapshotting it seems which could be a challenge

ignoramous · on May 25, 2022

In the works, apparently: https://github.com/firecracker-microvm/firecracker/blob/main...

LayerCI / WebApp.io did mention they migrate VMs... Fly.io should steal that tech https://news.ycombinator.com/item?id=25980897

yewenjie · on May 25, 2022

Does this mean you can run a dev VM on demand like how Gitpod does?

tomatowurst · on May 24, 2022

How does this compare to AWS Lambda's docker support

mrkurt · on May 24, 2022

Some of the plumbing is similar, but in general:

1. Lambda has runtime constraints and is billed per request

2. Fly Machines are basically just VMs with some magical startup sugar. You can do pretty much anything you want with them, including run for months.

Fly Machines are a partial answer to "how can I build my own Lambda service"?

__turbobrew__ · on May 25, 2022

Do Fly Machines support POSIX/System V shared memory? It is a giant pain in the ass for us because lambda does not implement these shared memory mechanisms which many multiprocessing libraries use to communicate. Makes it hard to utilize multiple lambda cores when running python code. You need to use multiprocessing due to the python GIL, but most of the python multiprocessing IPC uses shared memory.

We managed to hack a solution which uses pipes for IPC but it would be nice not to have to do this.

ignoramous · on May 25, 2022

I haven't tried using shm on Fly, but I don't see why they'd not work.

You could already run multiple processes in a guest vm [0] or run multiple guests vms on the same host [1]. If the guest kernel Fly boots your app into doesn't have required modules, I guess you could consider requesting (nerd-sniping 'em) specifically for those, like so: https://twitter.com/dave_universetf/status/14262218974072422...

[0] https://fly.io/docs/app-guides/multiple-processes/

[1] https://community.fly.io/t/2316

tomatowurst · on May 24, 2022

interesting, i was more pointing to the cold startup time on lambda with docker images vs this.

    Deploy App Servers
    Close to Your Users

is there a timeout limit to functions? This piques my interest but I can't tell if fly is a serverless function provider or some way to deploy my docker closest to my user (which is what I am looking for right now)

What guarantee is there in terms of average latency for my users? Is there a looking glass of sort where I can ping/see all the locations where my docker images will be running?

a dedicated 4-core 8gb ram is $124.00/month which is 4~6x more expensive than running on KVM vps so I want to know what I am signing up for

edit: I see the list of locations and it makes me think, aren't I already doing what fly.io is doing? I spin up a VPS instance at one of the locations that is closest to my user. It takes about 30~120 seconds. It's far far cheaper

https://fly.io/docs/reference/regions/

mrkurt · on May 25, 2022

Oh I see! Fly Machines are lower level than what most people use us for. They're designed for people building platforms.

Most people run "apps" on Fly. You control which regions they run in, we load balance to the nearest. We have guides for launching some frameworks here: https://fly.io/docs/getting-started/

The difference between us and a VPS provider is: your app runs in as many of those regions as you want. So does your database, if you're using our Postgres. And we route writes to the appropriate place: https://fly.io/blog/globally-distributed-postgres/

The other difference is probably CPU. The 4 cpu, 8GB RAM instances are 2 dedicated AMD EPYC cores + hyper threads. They're relatively expensive. You may not need them! VPS providers typically run cheaper CPUs and over provision them.

We're shipping shared CPU options with more memory soon. They should be closer to what you see from VPS providers, though still more expensive.

tomatowurst · on May 25, 2022

I see, so is it routing requests to a single instance or is it routing it to the nearest instance to the user? How will it scale if there are lot of users in a particular city?

WatchDog · on May 25, 2022

Lambda is fundamentally message oriented, you send an invoke request, lambda will either route the request to an existing warm instance, or will boot a new instance, it processes the request, then once the request is processed it will suspend itself.

Fly VM's are just VM's that can start quickly.

They don't seem to currently support a request/response based VM lifecycle.

If you wanted to use a fly VM in a lambda-like way, seems like you would need to have some kind of proxy to coordinate the work, ie start the VM via the API, have your VM process start a web server, once it's booted, send it a HTTP request, once the request is finished, shut down the VM via the API.

Also seems like fly can't suspend a running process, your process needs to start up every time you start a fly VM.

Lambda will suspend a VM between requests, keeping the process in memory for a few minutes.

Sending a subsequent request to a warm lambda is much faster then booting a lambda from scratch, particularly for JIT based language runtimes.

arthurcolle · on May 24, 2022

> We're not done. You need something to run, right? Firecracker needs a root filesystem. For this, we download Docker images from a repository backed by S3. This can be done in a few seconds if you're near S3 and the image is smol.

Lmao props to the team for getting this copy out unsanitized by (potentially) unchill bosses.

aarondf · on May 24, 2022

The author of the article is the CEO, which makes it at least 2x as cool.

kasey_junk · on May 24, 2022

The author of the post is the CEO

tptacek · on May 24, 2022

He and Chris Nicoll snuck it past me; I was too tied up in family business to moderate the tone and add the business-friendly grace we're so known for. But this is just a feature announcement; I don't think we really hoped this would be a front page discussion. We have some meaty stuff to say about how Fly Machines work coming up.

bayesian_horse · on May 25, 2022

vm.boot(speed='fast')

tag2103 · on May 24, 2022

I have to make a reference to ointment- it is obligatory.

WatchDog · on May 25, 2022

I know some prominent HN users work for fly.io, and they seem to be doing some interesting work, but the absolutely glowing response that every blog post gets here on HN seems a bit nepotistic.

tptacek · on May 25, 2022

I would hope people like the blog posts that we tend to "chart" with on HN, because we write them deliberately for HN and not to check marketing checkboxes. This post is not that; it's a straight-up feature announcement. You'll see me elsewhere on the thread, and many of us on Twitter, remarking that we weren't in a rush to see this on the front page.

We're painfully aware that we get a limited number of bites at the HN apple, and we try to spend those on things like Litestream.io, which is an open source project that benefits people who won't ever use Fly.io. Several of our last few blog posts were about stuff we've done "wrong"; so, we've also got no qualms about charting on HN with a post about how much trouble we've had with Raft, or user-mode WireGuard.

Dan Gackle has said a bunch of times that he wishes more companies got the lovey-dovey reaction we seem to get from HN. I've got the cheat codes, if you want them: write posts for the HN audience, and throw your marketing goals out the window. I'm not going to bullshit you and say that we don't benefit from those kinds of posts too, but I hope it's at least clearer why they're received more warmly than a lot of tech company product announcements: we don't write them to be product announcements. (Unlike this post!)

If there was a "No HN" meta tag we could set on our posts, this post would have had it.

Ordinarily I'd be squeamish about dragging us into metacommentary like this on one of our stories, because I'd rather argue about whether you can scale a modern full stack app entirely on SQLite than about our marketing. But, like I said, we've got no skin in how this post ranks here.

bluelightning2k · on May 25, 2022

I find your meta-commentary here interesting (sorry!)

To me this actually also suggests that the algorithm has a quality-score for both you as users and fly.io as a domain.

So both your style & the algorithm have like a flywheel with HN. (Makes sense to me.)

mrcwinn · on May 25, 2022

I am so excited for Litestream! (Adoption blocker for us is Prisma.) Keep up the great work.

TheFlyingFish · on May 25, 2022

I think it just pushes a whole bunch of the HN zeitgeist's happy buttons:

* Small, scrappy team taking on the big incumbents that everybody hates (AWS)

* Heavy focus on efficiency / latency

* Deep expertise in their areas of focus

* Lots of Rust

* Chill, non-corporate-y writing style

mlejva · on May 25, 2022

This is really really exciting! I hope it enables more products built on top of full VMs with fast UX/DX.

I just wish I knew about this earlier because from what I read, I think we at Devbook [1] built pretty similar service for our product. We are using Docker to "describe" the VM's environment, our booting times are in the similar numbers, we are using Nomad for orchestration, and we are also using Firecracker :). We basically had to build are own serverless platform for VMs. I need to compare our current pricing to Fly's.

[1] https://usedevbook.com