Given the ability to deploy pods to dedicated nodes based on label selectors, what is the actual performance impact of running a database in a container on a bare metal host with mounted volume versus running that same process with say systemd on that same node? Basically, shouldn’t the overhead of running a container be minimal?
The problem is kubelet likes to spike in memory / CPU / network usage. It's not a well-behaved program to put alongside a database. It's not written with an eye for resource utilization.
Also, it brings nothing of value to the table, but requires a lot of dance around it to keep it going. I.e. if you are a decent DBA, you don't have a problem setting up a node to run your database of choice, you would be probably opposed to using pre-packaged Docker images anyways.
Also, Kubernetes sucks at managing storage... basically, it doesn't offer anything that'd be useful to a DBA. Things that might be useful come as CSI... and, obviously, it's better / easier to not use a CSI, but to interface directly with the storage you want instead.
That's not to say that storage products don't offer these CSI... so, a legitimate question would be why would anyone do that? -- and the answer is -- not because it's useful, but because a lot of people think they need / want it. Instead of fighting stupidity, why not make an extra buck?
I run DB’s on K8s, not because I don’t know what I’m doing, but because most of the trade offs are worth it.
If I run a db workload in K8s, it’s a tiny fraction of the operational overhead, and not a massively noticeable performance loss.
I would absolutely love a way to deploy and manage db’s as easily as K8s with fewer of the quite significant issues that have mentioned, so if you know of something that is better behaved around singular workloads, but keeps the simple deploys, the resiliency, the ease of networking and config deployments, the ease of monitoring, etc, I am all ears.
If you think that deploying anything with Kubernetes is simple... well, I have bad news for you.
It's simple, until you hit a problem. And then it becomes a lot worse than if you had never touched it. You are now in the stage of a person who'd never made backups and never had a failure that required them to restore from backups, and you are wondering why would anyone do it. Adverse events are rare, and you may go like this for years, or, perhaps the rest of your life... unfortunately, your experience will not translate into a general advice.
But, again, you just might be in the camp where performance doesn't matter. Nor does uptime matter, nor does your data have very high value... and in that case it's OK to use tools that don't offer any of that, and save you some time. But, you cannot advise others based on that perspective. Or, at least, not w/o mentioning the downsides.
Everyone running databases in production knows how to take backups and restore from them. K8s or not, even using your cloud provider's database's built-in backups is hardly safe. One click of the "delete instance" button (or nowadays, an exciting fuck up in IaC code), and your backups are gone! Not to mention the usual cloud provider problems of "oops your credit card bounced" or "the algorithm decided we don't like your line of business". You have to have backups, they have to be "off site", and you have to try restoring them every few months. There is pretty much no platform that gives you that for free.
I am not sure what complexity Kubernetes adds in this situation. Anything Kubernetes can do to you, your cloud provider (or a poorly aimed fire extinguisher) can do to you. You have to be ready for a disaster no matter the platform.
> If you care about perf you would pin the kubelet
Wrong. I wouldn't use kubelet at all. Kubernetes and good performance are not compatible. The goal of Kubernetes is to make it easier to deploy Web sites. Web is a very popular technology, so Kubernetes was adopted in many places where it's irrelevant / harmful because Web developers are plentiful and will help to power through the nonsense of this program. It's there because it makes trivial things even easier for less qualified personnel. It's not meant as a way to make things go faster, or to use less memory, or to use less persistent storage, or less network etc... it's the wheelchair of ops, not a highly-optimized professional-grade equipment.
How would containers even hurt performance? How does the database no longer having the ability to see other processes on the machine somehow make it slower?
1. fsync. You cannot "divide" it between containers. Whoever does it, stalls I/O for everyone else.
2. Context switches. Unless you do a lot of configurations outside of container runtime, you cannot ensure exclusive access to the number of CPU cores you need.
3. Networking has the same problem. You would either have to dedicate a whole NIC or SRI-OV-style virtual NIC to your database server. Otherwise just the amount of chatter that goes on through the control plane of something like Kubernetes will be a noticeable disadvantage. Again, containers don't help here, they only get in the way as to get that kind of exclusive network access you need more configuration on the host, and, possible an CNI to deal with it.
4. kubelet is not optimized to get out of your way. It needs a lot of resources and may spike, hindering or outright stalling database process.
5. Kubernetes sucks at managing memory-intensive processes. It doesn't work (well or at all) with swap (which, again, cannot be properly divided between containers). It doesn't integrate well with OOM killer (it cannot replace it, so any configurations you make inside Kubernetes are kind of irrelevant, because system's OOM killer will do how it pleases, ignoring Kubernetes).
---
Bottom line... Kubernetes is lame from infrastructure perspective. It's written for Web developers. To make things appear simpler for them, while sacrificing a lot of resources and hiding a lot of actual complexity... which is impossible to hide, and which, in an even of failure will come to bite you. You don't want that kind of program near your database.
Whole core masking is not quite as easy as it should be, predominantly because the API is designed to hand wave away actual cores. The way you typically solve this is to go the other way and claim exclusive cores for the orchestrator and other overhead.
As these are obviously very real issues, and Kubernetes also isn’t going away imminently, how many of these can be fixed/improved with different design on the application front?
Would using direct-Io API’s fix most of the fsync issues? If workloads pin their stuff to specific cores can we incite some of the overhead here? (Assuming we’re only running a single dedicated workload + kubelet on the node).
> You would either have to dedicate a whole NIC or SRI-OV-style virtual NIC to your database server
Tbh I’ve no idea we could do this with commodity cloud servers, nor do I know how, but I’m terribly interested in knowing how, do you know if there’s like a “dummy’s guide to better networking”? Haha
> kubelet is not optimized to get out of your way...Kubernetes sucks at managing memory-intensive processes
Definitely agree on both these issues, I’ve blown up the kubelet by overallocating memory before, which basically borked the node until some watchdog process kicked in. Sounds like the better solution here is a kubelet rebuilt to operate more efficiently and more predictably? Is the solution a db-optimised kubelet/K8s?
This is extremely misinformed. No matter how you choose to manage workloads, ultimately you are responsible for tuning and optimization.
If you're not in control of the system, and thus kubelet, obviously your hands are tied. I'm not sure anyone is suggesting that for a serious workload.
Now to dispell your myths:
1. You can assign dedicated storage devices to your database. Outside of mount operations you're not going to see much alien fsync activity. This is paranoid.
2. You can pin kubelet CPU cores. You can ensure exclusive access to the remaining ones. There are a number of advanced techniques that are not at all necessary if you want to be a control freak, such as creating your own cgroups. This isn't "outside" of the runtime. Kubernetes is designed to conform to your managed cgroups. That's the whole point. RTFM.
3. The general theme of your complaint has nothing to do with kubernetes. There's no beating a dedicated NIC and even network fabric. Some cloud providers even allow you to multi-NIC out of the box so this is pretty solvable. Also, like, the dumbest QoS rules can drastically minimize this problem generally. Who cares.
4. Nah. RTFM. This is total FUD.
5.a. I don't understand. Are you sharing resources on the node or not? If you're not, then swap works fine. If you are, then this smells like cognitive dissonance and maybe listen to your own advice, but also swap is still very doable. It's just disk. swapon to your heart's content. But also swap is almost entirely dumb these days. Are you suggesting swapping to your primary IO device? Come on. More FUD.
5.b. OOM killer does what it wants. What's a better alternative that integrates "well" with the OOM killer? Do you even understand how resource limits work? The OOM killer is only ever a problem if you either do not configure your workload properly (true regardless of execution environment) or you run out of actual memory.
Bottom line: come down off your high horse and acknowledge that dedicated resources and kernel tuning is the secret to extreme high performance. I don't care how you're orchestrating your workloads, the best practices are essentially universal.
And to be clear, I'm not recommending using Kubernetes to run a high performance database but it's not really any worse (today) than alternatives.
> It's written for Web developers. To make things appear simpler for them, while sacrificing a lot of resources and hiding a lot of actual complexity... which is impossible to hide, and which, in an even of failure will come to bite you.
What planet are you currently on? This makes no sense. It's a set of abstractions and patterns, the intent isn't to hide the complexity but to make it manageable at scale. I'd argue it succeeds at that.
Seriously, what is the alternative runtime you'd prefer here? systemd? hand rolled bash scripts? puppet and ansible? All of the above??
> You can assign dedicated storage devices to your database. Outside of mount operations you're not going to see much alien fsync activity. This is paranoid.
This is word salad. Do you even know what fsync is for? I'm not even asking if you know how it works... What is "alien" fsync activity? Mount is perhaps the one system call that has nothing to do with fsync... so, I wouldn't expect any fsync activity when calling mount...
Finally, I didn't say that you cannot allocate a dedicated storage device -- what I said is that Kubernetes or Docker or Singularity or containerd or... well, none of container (management) runtimes that I've ever used know how to do it. You need external tools to do it. The point isn't that you cannot, the point is that a container runtime will only stand in your way when you try to do it.
> You can pin kubelet CPU cores. You can ensure exclusive access to the remaining ones.
No you cannot. Not through Kubernetes. You need to do this on the node that hosts kubelet.
And... I don't have the time or the patience necessary to answer to the rest of the nonsense. Bottom line: you don't understand what you are replying to, and arguing with something I either didn't say, or just stringing meaningless words together.
I do, though perhaps an ignorant life would be simpler. "Alien" is a word with a definition. Perhaps "foreign" is a better word. Forgive me for attempting to wield the English language.
No one well will use your fucking disk if you mount it exclusively in a pod. Does that make sense? You must be a joy to work with.
> The point isn't that you cannot, the point is that a container runtime will only stand in your way when you try to do it.
I have no idea what this means. How does kubernetes stand in your way?
> No you cannot. Not through Kubernetes. You need to do this on the node that hosts kubelet.
This is incorrect. You can absolutely configure the kubelet to reserve cores and offer exclusive cores to pods by setting a CPU management policy. I know because I was waiting for this for a very long time for all of the reason in the discussion here. It works fine.
You clearly have an axe to grind and it seems pretty obvious you're not willing to do the work to understand what you're complaining about. It might help to start by googling what a container runtime even is, but I'm not optimistic.
- containers are each isolated in a VM (aka virtualized)
- workloads are not homogenous and change often (your neighbor today may not be your neighbor tomorrow)
I believe these are fair assumptions if you’re running on generic infrastructure with kubernetes.
In this setup, my concerns are pretty much noisy neighbors + throttling. You may get latency spikes out of nowhere and the cause could be any of:
- your neighbor is hogging IO (disk or network)
- your database spawned too many threads and got throttled by CFS
- CFS scheduled your DBs threads on a different CPU and you lost your cache lines
In short, the DB does not have stable, predictable performance, which are exactly the characteristics you want it to have. If you ran the DB on a dedicated host you avoid this whole suite of issues.
You can alleviate most of this if you make sure the DB’s container gets the entire host’s resources and doesn’t have neighbors.
> - containers are each isolated in a VM (aka virtualized)
Why are you assuming containers are virtualized? Is there some container runtime that does that as an added security measure? I thought they all use namespaces on Linux.
Not so; neither Kata containers nor Firecracker are in widespread public use today. (Source: I work for AWS and consult regularly with container services customers, who both use AWS and run on premise.)