Microservices. They seemed really cool until I worked on a few large projects using them. Disaster so epic I watched most of engineering Management walk the plank. TLDR: The tooling available is not good enough.
The biggest cause lies in inter-service communication. You push transaction boundaries outside the database between services. At the same time, you lose whatever promise your language offers that compilation will "usually" fail if interdependent endpoints get changed.
Another big issue is the service explosion itself. Keeping 30 backend applications up to date and playing nice with each other is a full time job. CI pipelines for all of them, failover, backups.
The last was lack of promised benefits. Velocity was great until we got big enough that all the services needed to talk to each other. Then everything ground to a halt. Most of our work was eventually just keeping everything speaking the same language. It's also extremely hard to design something that works when "anything" can fail. When you have just a few services, it's easy to reason about and handle failures of one of them. When you have a monolith, it's really unlikely that some database calls will fail and others succeed. Unlikely enough that you can ignore it in practice. When you have 30+ services it becomes very likely that you will have calls randomly fail. The state explosion from dealing with this is real and deadly
Yeah, if you are going for a microservices architecture, you need at least one person or dedicated team in an oversight / architecture role that keeps the design and growth in check. Primarily that means saying "no" when someone wants to create a new service or open up a new line of communication. It's an exercise in limiting dependencies.
And the easiest way to do that is to not build a microservices architecture; instead (and I hope I'm preaching to the choir here) build a monolith (or "a regular application") and only if you have good numbers and actual problems with scaling and the like do you start considering splitting off a section of your application. If you iterate on that long enough, MAYBE you'll end up with a microservices architecture.
What saved us before, was our forest of code could depend on the database to maintain some sanity. And we leaned on it heavily. Hold a transaction open while 10,000 lines of code and a few N+1 queries do their business? Eh, okay, I guess.
Maybe we didn't have the descipine to make microservices work. But IMO our engineering team was pretty good compared to others I've seen. All our "traditional" apps chugged along fine during the same period
I don't think so. This kind of thing comes up constantly in RDBMS. New requirement means we need to join thneeds and widgets data together. In a regular database, even NoSql, this isn't a hard problem.
When the services have their own datastores, well now they need to talk to eachother
We actually tried this as well. It never made it out of testing. We ended up with copies of data in many places, which was annoying. We duplicated a lot of work for consuming the same events across multiple services and making sure they updated the "projection" the same way.
However a much larger problem was overall bad tooling. Specifically the data storage requirements for an event stream eclipsed our wildest projections. We're talking many terabytes just on our local test nodes.
We tried to remedy this by "compressing" past events into snapshots but the tooling for this doesn't really exist. It was far too common for a few bad events to get into the stream and cause massive chaos. We couldn't find a reasonable solution to rewind and fix past events, and replays took far too long without reliable snapshots.
In the end I was convinced that the whole event driven approach was just a way of building your own "projection" databases on top of a "commit log" which was the event stream.
Keeping record of past events also wasn't nearly as useful as we originally believed. We couldn't think of a single worthwhile use for our past event data that we couldn't just duplicate with an "audit" table and some triggers for data we cared about in a traditional db.
Ironically we ended up tailing the commit log of a traditional db to build our projections. Around that time we all decided it was time to go back to normal RPC between services.
I appreciate you sharing this. I'm considering embarking on this approach with my team, and everything you are mentioning is what I was worried about when I first started reading up on the microservices architecture.
Now I'm seriously considering a somewhat hybrid approach: Collect all of my domain data in one giant normalized operational data store (using a fairly traditional ETL approach for this piece), and then having separate schemas for my services. The service schemas would have denormalized objects that are designed for the functional needs of the service, and would be implemented either as materialized views built off the upstream data store, or possibly with an additional "data pump" approach where activity in the upstream data store would trigger some sort of asynchronous process to copy the data into the service schemas. That way my services would be logically decoupled in the sense that if I wanted I could separate the entire schema for a given service into its own separate database later if needed. But by keeping it all in one database for now, it should make reconciliation and data quality checks easier. Note that I don't have a huge amount of data to worry about (~1-2TB) which could make this feasible.
There's two main approaches to handling "events". Using event sourcing vs direct RPC. After our disaster I highly recommend Google's approach, A structured gRPC layer between services with blocking calls. You might think you don't have much data, we didn't either, but when Kafka is firehosing updates to LoginStatus 24/7 data cost gets out of control fast.
I'm going against the Martin Fowler grain hard here, but Event Sourcing in practice is largely a failure. It's bad tooling mostly as I mentioned, but please stay away. It's so bad.
"every service can just subscribe to the data it needs."
Doesn't that imply that each service then has to store any data it receives in these events - potentially leading to a lot of duplication and all of the problems that can come with that (e.g. data stores getting out of sync).
Yes, that's exactly what it implies. Like I said at the top of this thread, I'm not an expert on this approach (I've done my reading, but haven't yet spent time in the trenches), but my understanding is that you would embrace the duplication and eventual consistency. I do wonder how well it works in practice though, and how much time you would spend running cross-service reconciliation checks to make sure your independent data stores are still in sync.
A micro service should not depend on another micro service! I see the same mistake in plugin and module design patterns.
When you make one service depended on another services you add complexity. Some complexity is necessary, but everything (scaling, redundancy, resilience, replacing, rewriting, removing, etc) will be more easy without it.
The problem is that your run of the mile buzzword-driven Microservice is basically a collection of FaaS (logout function, login function, ping function, get user function, update user function) behind an API Gateway, what constitutes a single-responsibility is up for careful consideration, but imho, microservices as perpetuated by the mainstream buzzword-cowboys high on cloud is very ill-informed and only suitable for very-very large teams with extreme loads.
Services having a single responsibility sounds like good advice, but how do you turn a number of services with a single responsibility into a working application? Any process that touches multiple systems will become a lot more complicated. Single-responsibility services is good advice but it's too easy for short-sighted developers to obsess over that - instead of the bigger picture. Yes it makes it easier to carve out your segment of an application, and yes that codebase will be easier to maintain, reason about, and maintain, but someone has to keep bigger picture in mind. That's often lacking.
It's not that easy in my experience. They use different databases. Different versions of frameworks. Some written in different languages. We tried to have a "one size fits all" CI pipeline but that fragmented over time.
The overhead was huge compared to "traditional" apps. Just updating a docker base image was a weeks long process.
Is that really so bad? At edX all of our services were Django. After the third service was created we built templates in Ansible and cookiecutter to create future services and standardize existing ones. We created Python libraries with common functionality (e.g. auth).
We were a Django shop. Switching to SOA didn’t mean switching languages and frameworks.
If your services were all setup the same, what was the big advantage to have them separate? Wouldn't you get the same scalability from running 10x of the monolith in parallel with a lot less work?
The primary advantage was time to market. When I started five years ago edX had a monolith that was deployed weekly...after a weekend of manual testing. The organization was not ready to improve that process, so we opted for SOA. By the time the monolith had an improved process—2 years later—we had built about three separate services, all of which could be deployed multiple times per day.
haha I see you haven't worked with edx, basically a lot of services just go down and the main reason to have them separated is so they don't ALL go down, insights/metrics service infamously is hard to get up and steady.
While acknowledging the problem mentioned, I still believe in microservices, but in my opinion, it needs to be done with simpler tools. For example, next time, I will use firejail instead of docker.
I think that the issue is that microservices _require_ good practices and discipline.
These are attributes that 80% of projects and teams lack so when they decide to jump onto the microservices bandwagon the shit hits the fan pretty quickly.
The biggest cause lies in inter-service communication. You push transaction boundaries outside the database between services. At the same time, you lose whatever promise your language offers that compilation will "usually" fail if interdependent endpoints get changed.
Another big issue is the service explosion itself. Keeping 30 backend applications up to date and playing nice with each other is a full time job. CI pipelines for all of them, failover, backups.
The last was lack of promised benefits. Velocity was great until we got big enough that all the services needed to talk to each other. Then everything ground to a halt. Most of our work was eventually just keeping everything speaking the same language. It's also extremely hard to design something that works when "anything" can fail. When you have just a few services, it's easy to reason about and handle failures of one of them. When you have a monolith, it's really unlikely that some database calls will fail and others succeed. Unlikely enough that you can ignore it in practice. When you have 30+ services it becomes very likely that you will have calls randomly fail. The state explosion from dealing with this is real and deadly