My overall reaction is that this is a great piece for products/teams that have reached significant scale, once the job to be done is too big and complex for one team to own, end-to-end, or there are truly reusable concerns that can be separated from the core product (e.g. auth, observability).
Because interfacing via API is expensive. Writing APIs for others to use productively isn't easy and change management also adds a lot of overhead. And if we're talking about network APIs, there's a ton of distributed systems complexity to account for.
> The problem with the DevOps movement is that it ended up taking “shifting left” to the extreme. In this sense, development teams weren’t so much empowered to deliver software faster; rather, they were over-encumbered with infrastructure tasks that were outside of their expertise.
This. In truth, I think this is a major misinterpretation of DevOps, which is meant to empower devs without loading them down with incidental complexity. But I experienced exactly this misinterpretation at the first place I worked that had embraced DevOps culture.
> Because interfacing via API is expensive. Writing APIs for others to use productively isn't easy and change management also adds a lot of overhead.
I agree in principle, but there is a lot of “unseen coordination/communication” costs that it’s easy taken for granted.
When I was working on telecom doing interfacing with carriers (e.g T-Mobile, Verizon, etc.) on thing that I noticed was how simple was to work with those folks: Ok, this is the standard XML, those are the endpoints, that’s the list of error codes, the rate limit is X requests per second, a bunch of files will be on this FTP at 5AM daily basis, and if you face more than 100ms latency from our side just call this number.
Working with “product” companies without silos most of the time it’s design by committee, folks that won’t keep the service running wanting to have a say in our payload wanting us to change our overly reliable RabbitMQ to use their Kafka.
This is an interesting example, becase telco has an actual API standards committee (TM Forum). Telcos have decades of experience and extremely well defined (and to a large degree shared / interchangeable) domain model. It's an ideal scenario for APIs.
Meanwhile your product companies each develop different product, there's little standardization. API designers have only vague idea what the API will be used for and how. Fast evolution is important.
TM Forum provides a bloated format. It has everything and a kitchen sink. This means that something simple like customer name can be written in multiple places, and different architects will recommend different location. Two different silos within same company won't be able to talk to each other because although they are using same format, they interpret the usage different.
Additionally, this is XML/XSD, different teams will end up using different version of the standard. They won't be able to interface with each other without additional layers of translations. It is not uncommon for one team to need to load multiple versions of the XSDs because different end points use different versions.
> When I was working on telecom doing interfacing with carriers (e.g T-Mobile, Verizon, etc.) on thing that I noticed was how simple was to work with those folks: Ok, this is the standard XML, those are the endpoints, that’s the list of error codes, the rate limit is X requests per second, a bunch of files will be on this FTP at 5AM daily basis, and if you face more than 100ms latency from our side just call this number.
To me, that probably reflects the maturity of the services the carriers provide. And presumably that there's an explicit customer-producer relationship? These things justify the complexity of maintaining a well curated and operated API.
> Working with “product” companies without silos most of the time it’s design by committee, folks that won’t keep the service running wanting to have a say in our payload wanting us to change our overly reliable RabbitMQ to use their Kafka.
If I understand what you're saying, you've experienced platform people telling you what tech to use, without having real skin in the game for operating your services? If so, that sounds very irritating. To me, a truly silo-less approach would not have that.
To the extent that there are platform teams with a say in architecture, I think they should develop requirements around the external characteristics of the deliverable (performance, cost, observability, contract with other teams, etc) and largely leave the implementation concerns to the people developing and running the service.
> folks that won’t keep the service running wanting to have a say in our payload wanting us to change
This is my single biggest gripe with DevOps. If you’re not going to be fixing it, why on earth do you get a say as to how I build it? It’s nearly _always_ a one-way street, too – when’s the last time Ops successfully ordered Dev to change some specific part of their code (modulo things like, “hey, you really need to add a rate-limiter”)?
IMO, this phenomenon is not specific to DevOps. I've seen this happen when there are official architects or architecture committees.
One of the problems about debating DevOps is that there's no single agreed upon meaning. That probably also impedes its success.
To me, the essence is that there are engineers whose job it is to give product and service devs the ability to do their own operations through software that simplifies basic shared concerns. That does not mean mandating solutions.
> Because interfacing via API is expensive. Writing APIs for others to use productively isn't easy and change management also adds a lot of overhead.
An under appreciated point. Something that is affordable for giant monopoly profit driven FAANGs, but harder in smaller orgs.
Even within my team, the "level of difficulty" of writing an API depends on its user base. If I am the primary user, it's easy. There's one really sharp kind on my team, if he is going to use it, then 2x the difficulty. If I need it to handle the median dev on my team, then 2x again. By the time you get to the below median dev, add another 2x. So even within my team, the intended user base can change the difficult from 1x->8x.
What are some of the pain points - input validation that drives useful informative errors, flexibility on inputs like having sensible defaults to reduce the number of inputs users must pass, performance/scaling, and edge cases.
>> In this sense, development teams weren’t so much empowered to deliver software faster; rather, they were over-encumbered with infrastructure tasks that were outside of their expertise.
And agreed on this point. Amusingly I have had HN thread arguments just this week with DevOps advocates telling me that akshully their job isn't to empower devs, but some sort of "tail that wags that dog" interpretation around devops organizational / standardization / cost / etc.
Yes APIs are expensive and require significant overhead.
But clean, maintained interfaces between separate domains are far LESS expensive than multi-domain/team mashups.
Maintaining separate teams AND well- maintained interfaces surfaces and makes explicit the myriad inter-silo interactions.
Instead of a quick backchannel conversation and a quick hack, it is an explicit conversation and an API adjustment.
Yes, it's more expensive up front, but far cheaper and more powerful in the long run.
So for high-speed prototyping, not so suitable- breaj the silos, do the backchannel hack & move on (this version will likely be binned or massively refactored anyway). But when the domain is more stable, modular with solid interfaces is the way to go.
> My overall reaction is that this is a great piece for products/teams that have reached significant scale, once the job to be done is too big and complex for one team to own, end-to-end, or there are truly reusable concerns that can be separated from the core product (e.g. auth, observability).
I'd say it's critical in the small scale too. When you get down to a single person size, this is just what well factored code is: you write down little reasonably independent pieces that can be learned about and thought about without having to think about all the other things. After all, we've all had that experience when you go back and revisit your code after vacation or something, and it might as well have been written by someone else.
I was on a tiny team not long ago where my teammates kept writing tightly coupled systems, then rewriting everything from scratch every time. It was hell. Our product moved slowly, broke constantly, and we couldn't build off of our prior work, and could barely even build off of each others' work, so velocity stayed constant (read: slow).
(as a tangent, the communication patterns of remote work seem to make this more important)
Siloing teams, siloing concerns, and writing modular code are all kind of the same thing, just at different scales.
As a coda to this, I think that we are entering a new world where services truly can have minimal operational overhead.
At my last company, we were all in on serverless, with AWS tools like SQS coupling things together. It worked extremely well for keeping the architecture simple to operate and approachable for new people.
But even better, I think we want logical services that interface with each other as though they were in a single process. We want the ability to write code as though it is monolithic (e.g. lexical scope and native-language APIs), while availing ourselves of the advantages of independently deployed services. I believe projects like Temporal point the way. I haven't had the opportunity to use it, but philosophically, I think it's the right direction.
I don't have experience with it myself, but Polylith architecture looks interesting. There you compose services based on shared components. You can start developing it as a monolith, and then extract components to separate services by just changing the interface between components.
The problem is, you cannot abstract away a 3 order of magnitude difference in performance and latency. This problem increases with larger teams as people don't understand what is happening. Abstractions shorten the learning curve but also greatly slow the formation of an accurate mental model of how a system works.
The mental model of using SQS and serverless is extremely simple. You put a message in, get a message out, and have your code run.
This is something I’ve been able to teach teams how to do in a few hours. This is the a huge part of the value prop of these services.
Performance and latency have a trade off, sure. But you get teams who have complete ownership over their service. You get services that can scale to millions of requests per minute transparently. Some of the highest scale workloads I know use microservices behind SQS.
> You put a message in, get a message out, and have your code run.
In my experience it's more like put a message in, get an `AuthenticationException`, copy/paste some IAM stanzas into CloudFormation and redeploy, put a message in, get a configuration error, add dependency-injected config loaders to constructors, add env vars to CloudFormation and rededeploy, put a message in, get a deadlock from the threaded HTTP client due to class loading being single-threaded, refactor config loading to happen after construction but populate a cache, redeploy, put a message in, get a timeout due to a stampede on the cache, get a hefty bill from AWS for spinning up a load of long-lived serverless processes, ...
I have been working mostly with Kafka for five years, and I found SQS to be a little weird to work with. How do you keep an overview of all services that consume a queue? What if you want multiple services to read the same data in the same order? Granted I am very new to AWS.
I think you're really looking to use kinesis rather than SQS.
SNS+SQS make a pub/sub setup, but every queue subscribed to the event is fully separate, and you wouldn't expect to try to couple the different listening queues together.
You could make all the queues into SQS fifo queues, and put them all on the same sorting group, but I think network time could still break the ordering on different subscribers?
The simple way to use SQS is to have a queue for each message type you plan to consume, and treat it as a bucket of work to do, where you can control how fast you pull the work out
This is one of those utopian engineer fallacies that is on the same level as reliable networking and fsync. People don't work like computers and don't respond like API endpoints. There will always be backchannels, backburners, pidgeonholes, and all the other political games humans play happening even with an "API between them".
The problem is always management, and the solution is never "more management", IMO.
I think that depends on the type of service that the team provides. If you have a central team that many other teams interact with, they risk becoming a bottleneck. They may not be interested in maintaining custom APIs for each team interaction and you will need to agree on a contract that all can live with.
Another risk is that the team providing the service also have their own backlog, including work they want to do themselves and requests from other teams. This can cause unwanted dependencies and delays where managers try to fight to be prioritized on the expense of others.
Most of us don't have mega-scale problems, though. A tremendous amount of waste has been created by applying FAANG tech and processes to completely different contexts.
Sure, but scaling of an organization is not the same as scaling for traffic in a technical sense. There are so many companies that employ comparable numbers of engineers that are not big tech companies.
That makes sense to me. I think you're right that if you're a large enterprise, it may well make sense to adopt a very API-centric strategy for how teams interface, even if the scale doesn't demand it from a tech perspective.
I'm not saying it can't work, but that there are risks involved.
I have worked for several companies ranging from local startups to global enterprise (not FAANG). Each company tried the silo approach when they migrated to micro services and it caused significant delays and dependencies. They would have been better off if they focused more on larger domain services with fewer external dependencies.
I am open to the idea that Amazon has been able to avoid these problems, but it's clearly not a silver bullet.
In general I have to say I'm sceptical about comparisons with FAANG, because they live in a completely separate part of the technology sector. They have income similar to small countries and can live with inefficiencies that can break a startup.
The problem I see are big companies with several products trying to break down silos between the products to share some infrastructure (be it code, libs, actual cloud infra, support teams, design systems etc) when there is very little overlap between the different products.
All in some grand hope of reducing costs by sharing things. It almost always ends with overly generic solutions that are harder to use, takes more people to support, can't be fitted well in most cases and that everyone involved hates (causing employee attrition).
This is different from having cohesive architecture within a single product.
> Each company tried the silo approach when they migrated to micro services
Doesn't that go without saying? That's literally what micro services is: The siloing of services, just as service is provided in the macro economy, but within the micro economy of a single organization. Without silos, your service is monolithic.
Mostly because someone at a higher level said "Your APIs are not a silo, and if you act like they are you will be terminated".
The communication cost will always be there, the question is one of how is it implemented. In the case of an API it tends to reduce the communication costs when someone is forcing all teams at gunpoint to write clear, concise, and well documented APIs and don't allow them to change said APIs without clear, concise, and well documented rules.
I've worked with teams that communicated via API and started randomly changing shit without proper documentation, and without management being held to the fire over their actions, it's just a new type of silo.
Agreeing means the providing team understanding and meeting the needs of the consuming team. Teams that can work together to accomplish this wouldn't be called "silos."
However, when two teams work together to create an API, or a process or some other self-service mechanism, sometimes the API so good that the teams no longer need to talk to each other. The practices and relationships that enabled communication fade away. Walls go up, silos form, but nobody notices anything wrong, because it seems like efficiency is getting better and better. Over time, though, people start to notice that projects that encounter a need to change the API always fail. The API has become legacy, baggage, a problem.
There may still be somebody on the providing team who remembers that they used to get together in the same room with developers from the consuming team to come up with solutions together, and they'll naively suggest that as a solution, but there are now too many assumptions baked in at the management level for that to be allowed to happen. A change in the working relationship between teams means the managers will fight over what this means for different managers' prestige. Somebody's cheese is going to get moved. Managers gear up to go to war over things like that, so upper management dictates a solution that minimizes inter-manager violence, a solution that carefully circumscribes the kinds and amounts of contact that members of the two teams are allowed to have. Voila: silos with windows, and engineers sitting in their respective windows forlornly waving at each other like lovers separated by their parents.
But this is equivalent to saying “a monolith can never work because it’s highly coupled”. In both cases you need to follow best practices to make things work. API design and alignment with consumers of the API is table stakes.
Unless you work at a place where ops don't write code because "it's the dev's job to write code", so everything is built and done manually; and devs don't have access to any infrastructure and don't bother about it, because "it's the ops' job".
And management of both sides agree with this vision.
"works on my machine" and "the issue must be with the code" are the most used excuses by both sides when something fails.
The ops "api" works by email, but the response delay is usually expressed in weeks.
Most memorable quote from that place : "Yes, a 4 month delay for an answer about your new server might seem long"
I find myself these days on one of these ops teams. The lead time is the same for deploying either a code or infrastructure change, and probably closer to 3-4 months here.
It's not an operational capacity or competency issue for this org, it's the result of hours worth of "sync" or "review" meetings with no discernible agenda, negotiating maintenance windows, facilitating approvals from a dozen or more parties who don't even comprehend what they're approving, and weeks of manual acceptance testing.
On the other extreme, in past roles at different orgs, I've been on teams doing multiple deployments to production every day, both on the dev and ops sides.
I find it exhausting and soul crushing being completely untrusted because of the mistakes made by people who left the org years before I started.
I have heard similar perspectives on this. That folks are moving away from devops and towards platform engineering. The idea being that a platform team reduces the friction to deploy code by building self-serve APIs, libraries, and infrastructure to be used by dev teams.
Even when in teams practicing devops well I have always known at least one or two people who are great developers but they don’t want to know anything about operating systems, sockets, file descriptors, service level objectives and all that. I always found working with them to be challenging.
I’m very much in the, “you wrote it, you run it,” camp.
While platform engineering sounds great I’ve also worked with teams trying this and it has its own trade-offs: as demands on the platform team grow it can take longer to wait for your change requests to be deployed and depending on how ownership at the company works.. it can be frustrating: you could have fixed it yourself and shipped sooner but now you have to live within the constraints set for you by the platform team. You also end up with a development culture that has a hard time understanding service performance objectives.
This can be a great thing for some companies for sure. But I haven’t seen a cure-all for siloing teams. Conway’s Law and all.
> I have always known at least one or two people who are great developers but they don’t want to know anything about operating systems, sockets, file descriptors, service level objectives and all that
IMO, those are not great developers. If you know nothing about how your code is running at a low level, you will reach a point at some scale where you’ve caused performance issues, and don’t have the fundamental knowledge necessary to fix it. Skilled or good, sure, but “great” implies mastery.
> I’m very much in the, “you wrote it, you run it,” camp.
As a sibling comment of mine pointed out, these are generally orthogonal skillsets. The main problem I’ve seen with it is that due to the relative ease of XaaS, dev teams with little to know knowledge of infra can indeed stand up an entire stack, and it will work quite well at first.
Databases, for example, are remarkably fast at small scale, even when your schema is horrible and your queries are sub-optimal. But if you don’t know the fundamentals (my first point), you won’t know that it’s abnormal for a modern RDBMS to take hundreds of milliseconds for a well-written query (assuming it’s not a cold read). You see that your latency is “good enough,” so you move on with the next task.
Then, when the service gets more popular, you hit the limits of the DB, so you vertically scale (if it isn’t already “Serverless” and auto-scaling – barf). It isn’t until the bill is astronomical that someone will bother to ask why it is you need a $10K/month DB.
> as demands on the platform team grow it can take longer to wait for your change requests to be deployed
That sounds like an ops or deployment team, not a platform team. A key feature of a platform should be that the developers choose when a deployment happens.
> A key feature of a platform should be that the developers choose when a deployment happens.
Agreed. When I was on a platform team we wrote tools to take a process that used to be done by a deployment team (change a DNS record was a helpdesk ticket) and move it into a self-serve system (PR your desired DNS changes in, upon merge, the system deploys the changes), which kept audit happy because 'dev' wasn't touching 'prod' in the unfettered way SOC2 people stay up at night worrying about (even though Enron happened because of bad managment not Office Space but anyways), while still giving Devs effective control of when and where they wanted to make production changes, whether relatively ad-hoc or as part of a CI/CD pipeline.
Humans could approve the self-service PRs, or if a list of in-code rules had been fulfilled, the PR would be auto approved (and potentially even merged but everyone but us was too afraid to set that part up).
Platform means different things to different people but it should offer a standardized way of doing things that's well supported while allowing for customization by developers when needed (with less support when you step out of the golden path). It's a combination of software, processes and management buyin.
The way I see most "DevOps" teams working is they're just writing scripts that do whatever was requested without much thought about how sustainable that is, how company-wide policies can be enforced, or retrofitting improvements to other codebases... It's all very quick-and-dirt solution, one after the one until they end up in software engineering madness with developers complaining that things take too long or break easily while devops engineers complain developers don't know what they are doing. It's not a productive situation to be in.
I think platform engineering is just about having a systematic approach that gives developers more peace of mind so they can focus on actually coding features while giving the rest of the organization a bit more control points about how things are maintained. It's the 80/20 rule applied to devops, I guess. Just enough centralization.
I'm also very excited about platform engineering and I think it's a natural progression because, frankly, what people call DevOps these days is just a nightmare. God forbid the org has "distributed DevOps" a.k.a. do whatever you want in your team and when it's time to make a global change we will work with >20 different ways of doing something. That will be quick.
> what people call DevOps these days is just a nightmare.
I agree! It's taken on a, "you'll know it when you see it," kind of definition and it's hard to pinpoint what "DevOps" is and whether your organization is practising it.
And so I've often seen it become a veneer for the original "silos" it was meant to break down: those handful of developers who still want nothing to do with managing their code and services in production get to throw code over the wall and someone else gets to hold the pager and keep it running, make it fast, etc.
In other words, the company hires "devops" which becomes a new title for "system administrator," and everything stays the same.
Platform engineering, devops, it's all evolving... but some things never seem to change.
A platform team will eventually and quick fast become a bottleneck, especially if they offer abstractions over infrastructure, paas, saas etc.
Such a team is not any different than previous admin/infra teams. The operation model. When their offering breaks or is incapable, dependent teams wait or is stuck.
A team should be self sufficient and be able to decide for themselves how they cope with what they need to move forward in a sustainable manner. That means they should do everything themselves towards enabling there offering.
I have worked with so called CI/CD teams at least 10 times and it doesn't foster what devops is all about, which is essentially collaboration, mandate and responsibility for your software, inside a smaller team.
It comes down to how you organise and where decisions are placed and carried out. Something that reflects on centuries of command and control organisations where it has been around controlling and not enabling and autonomy.
I have worked a few places where the platform decision was taken on a broad spectrum, aws for example, and from there every individual teams could do whatever.
Every software/product team should be able to do everything they need to develop,test,deploy,secure and monitor their things. It works in a lot of places so of course it can work in a lot of other places.
The problem with "You wrote it, you run it" is they are completely orthogonal skillsets, and you lose the benefits of specialization. It introduces a high amount of context switching, and loses consistency in how services operate.
APIs don't live in a vacuum. Sorry, but you can't just ship a service that exposes an API and expect people to use it. It MUST be documented. And no, your auto-generated Swagger/OpenAPI docs don't cut the mustard.
If, as an executive, you expect the teams you set up to actually be independent, then treat them like Product teams in their own right. Set expectations to write Product-quality documentation, at least as good as what you ship to customers. Hire internal Product managers for those teams, who will learn how internal customers use those APIs and what else they need to solve their problems. Hire internal Marketing to ensure everybody else knows that the APIs exist. Sound ridiculous? What, do you expect your Engineering teams to have these skillsets already? If you don't expect your company's product to succeed without Product and Marketing then why, pray tell, would you expect your internal products to be any different?
Mandating that is a terrible idea but I think giving senior people (and go-getting junior ones) in team A an opportunity to do a 6-12 month rotation in team B, and so on, is a great idea.
I can resonate with the author, giving the context working on Scaleups and Big companies, where the coordination and communication plays a very heavy toil on the actual work.
I do not believe in this model of “big collaboration” with folks swarming around a problem and each one with partial context trying to fix something that structurally has a huge communication, coordination and technical coupling.
The best work experiences that I had was in teams that we establish the APIs, S3 buckets where the processed batch files will stay an maybe a e-mail list where someone will reply if something change.
I do not have a straight answer, but one thing that I saw working is the concept of “Good fences” where folks still act as a team but has clear touch points related with problems and when things goes south everyone knows what to do.
I thought this was going to be about siloed products. Like AWS is a silo, because once you use AWS you are effectively locked in, whereas WinterCG and Unix standards make implementors non-siloed.
Oh well, this take would be a lot more spicy if so
Silos are not a problem. Leadership quality is a problem that silos can greatly exacerbate.
Silos exacerbate two problems. The first problem is that there is an area between two silo's which can be poorly owned or lack stewards. The second problem is that the first manager above two different silos is the de facto resolver of disputes and resource application for and around those silos. This area often has no advocate, because to advocate for resource application to that area is to volunteer yourself. This area becomes a blind spot to leadership through systematic neglect. In many cases this leadership can be a checked out ex-google CTO who has never run an org under resource constraints who isn't hungry because they are already well off. Checked out Rest and Vest early employees who, through the peter principle, end up in CTO positions can also be extremely dangerous to organizations.
If you have poor leadership, you end up with two silos that don't want extra responsibility, because more responsibility without more resources is a losing prospect. Under bad leadership, anything that is not feature production is not rewarded. The end result is that each silo becomes aligned against other silos, rather than aligned to a business goal.
Defining an API seems like the primary goal is to completely remove the area between two silos, thus alleviating the problem of opaque ownership. I agree. Explicit ownership for all artifacts helps an organization run much more effectively.
The real problem is a culture where every individual employee does not feel responsible for business outcomes with proportionate recognition by leadership of taken responsibility for those outcomes.
Here is Admiral Rickover's take on culture:
> Professionalism occurs when individuals act in the best interest of those being served according to objective values and ethical norms, even when an action is perceived to not be in the best interest of the individual or their organization. That is, there are times when professionals must sacrifice their own interest (or that of their organization) to meet the objective values and ethical norms of the profession. Professionals, in this sense, are serving something greater than the bureaucratic organization that employs them.
> If Admiral Rickover had a mantra to shape a professional culture, it would have been, "I am personally responsible."
When leadership does not practice personal responsibility or engages in blame, that ripples through the entire organization. Silos that don't function well are symptoms of leadership failing to take responsibility and cultural failure, not necessarily structural failure.
there is a difference between silos and domains. It’s great for domains to be separated, and independently well-defined. The silos I have seen cause problems are between organizational functions, like “Product” and engineering or “The Business” and implementation teams.
I have seen managers create task funnels for requests coming from other teams. They assume this has fixed all coordination issues.
Later I see engineers having to engage week-after-week out to other teams for coordination. Things are not in place. People don't respond, coordination "APIs" don't work as documented. I have seen individuals blamed for dependency failures that they had no part in.
I rarely see the obvious solution: Why don't managers spend day-in-day-out coordinating work across teams/orgs/corporate barriers?
Instead of doing coordination, managers seem to be spending more time doing HR stuff like vacations, performance calibrations, documenting ways to blame engineers. They create such APIs and think their job is done. Why? How? What a severe deadweight in the company.
I have no comment on this article's specific claims, and I have every reason to believe the author is both intelligent and insightful.
That said, so many times in my professional life I run into the same problem: people think some idea is a truism which broadly applies to all or most situations. The problems faced are seldom deeply understood, and so that truism is misapplied by people attempting to follow the latest best practices.
This seems like a pernicious meta-problem related to profitability and work resources. ie, people cannot deeply understand all problems, and so they are content to make larger errors in a number of places so long as "more work" is getting done. I don't think this problem will ever disappear.
Coming from a job where we transitioned to cross-functional teams, all we did was reorient the silo. The business was unhappy with teams oriented around specific projects, and felt we were struggling to produce quality releases according to the roadmap because teams had lack of understanding of the big picture along with competing priorities.
With a CFT, we would focus on a vertical slice of whatever systems it took to deliver a feature, and orient our team structure around the right expertise in the systems we’d need to touch. We would also take responsibility for managing our own releases rather than relying on bottleneck-prone maintainers to do it.
This seemed sensible at first, it even gave us a sudden burst of productivity. It soon fell apart, however, as now it became incredibly difficult to manage releases across products where multiple teams have competing priorities. Quality quickly plummeted and we had an even worse time releasing software, and every end of sprint the problem compounded. We still ended up with a bottleneck of a few experts to sort through the firehose of changes each team was trying to push out and get it handled. Our workload actually increased, along with stress and frustration.
The issue in this case was a perfect storm of tech debt, poor planning, poor architecture, and simply too many priorities. The business thought that rotating the team structure 180 degrees magically multiplies our productivity, and somehow convinced themselves that the work these teams do are completely independent, all while building out a roadmap with very little cooperation from the engineers. On top of investing in fixing the technical and organizational issues just producing software, I would rather leadership had focused the entire business on a small set of complimentary or at least non-competing (limited by available resources, of course) priorities so we could each focus on an piece of the puzzle and bring our particular expertise to bear on producing a coordinated and high quality solution. I certainly agree that teams oriented around their microservice or whatever was bad, but orienting around a project/feature really any different; we should have been oriented around a holistic product and the customers’ well-being.
This is true for some definition of a silo, but to me the term inherently implies difficulty in interfacing. If you can easily pull information out or push it in, it's not a silo in the sense people complain about.
So to me this not so much a correction of the definition, but a technical resolution. There are several problems, though. The owner of the silo may benefit from it. This is called "lock in". If the company has any say, the silo owner should be properly incentivized to ensure "good citizenship".
This is how it starts. First ChatGPT starts suggesting all these bloggers "Silos are fine as long as they have APIs" articles and "silo this, silo that". Soon enough the overton window has shifted and we're talking about military silos and silos having APIs in the same sentence.
Before you know it, GPTskyNet5 is firing nukes off using poorly secured and thought-out webhooks setup by some random defense contractor at missile silos because some DoE scrum master needed a promotion.
I'd suggest that silos are necessary as human communication bandwidth is limited. Everyone talking to everyone, all the time doesn't scale. The key is to create silos along natural "fault lines" such that less communication is required.
Nothing wrong with having clear boundaries and different owners across high-quality data sources, but IME "silo" usually comes up in the context of egregious data duplication and ambiguous sources of truth.
I was really hesitant to use uuids as primary keys because it’s basically worst case clustering performance. Ended up just using 32 bit Unix time followed by 12 random bytes. Not sure why that isn’t an official version.
If you insist on using a B-Tree index and random UUIDs that's certainly the case. Many of the formulas used to generate UUIDs in the past had terrible privacy implications: Office 95 would fill documents with UUIDs generated based on MAC addresses and timestamps so Office documents could be tracked to particular machines until Microsoft changed this with little fanfare.
An API is actually not good enough. It's the minimum you could possibly have to allow communication. So the silos can "work together" though that API, but problems then occur due to lack of understanding of how each silo actually works under the hood.
Imagine a bunch of microservices built by different teams. They just send each other their OpenAPI specs and some URIs. So they start calling each other's APIs. Everything seems to work fine.
But wait. Are there limits on this API? How many calls, or how much data can I send? What's the SLA on these transactions? What happens to data, how it's stored, processed, backed up? If I send X data to service Y, do I know service Z is going to get the same data? If one of these services goes down, is everything going to go down, and is that team staffed for 24/7 support? Do they even know what to do when things are down? When everything does down, how do these silos know which thing was the cause and who to alert to fix it? Does it require multiple silos to fix?
All of that and much, much more, is deeper knowledge related to the entire system, which is the inter-relation of all these silos from top to bottom and sideways. The API doesn't solve these problems or answer the questions. The API only tells you how to do one thing.
The premise of silos is the idea that you don't need to know anything else about the rest of the world but some tiny bit of information. Well, reality says different.
DevOps is not about "merging teams", but communication and collaboration between teams. They should understand each other well, or at least make it much easier to discover the right information in order to improve outcomes. You absolutely have to have specialized teams where people have domain knowledge. But you also need to provide the tools and practices that enable very different teams to work together to build things right and solve problems quickly.
Say you're building cars. A dealership's mechanics notice a belt keeps rubbing on a cable or hose. That needs to be quickly notified to the assembly people to see if it's an assembly problem, and if not, it needs to be sent to the mechanical engineers to address a potential design flaw. That all needs to be done quickly, as soon as the problem is noticed, because cars are shipping every day. The longer it takes for that whole loop to complete, the more bad cars are shipped. By focusing on improving the loop between all these different groups, you improve business outcomes. But "an API" isn't going to do that.
That's why the idea of a "DevOps Engineer" is wrong. This isn't an engineering problem. This is a business process problem. The communication between teams is not an API, it is really organizational structure and practice. Engineers noticed the problem, and wanted to fix it, but they failed to use the language of management. So instead people slapped "Engineer" on the concept and everyone got confused.
Because interfacing via API is expensive. Writing APIs for others to use productively isn't easy and change management also adds a lot of overhead. And if we're talking about network APIs, there's a ton of distributed systems complexity to account for.
> The problem with the DevOps movement is that it ended up taking “shifting left” to the extreme. In this sense, development teams weren’t so much empowered to deliver software faster; rather, they were over-encumbered with infrastructure tasks that were outside of their expertise.
This. In truth, I think this is a major misinterpretation of DevOps, which is meant to empower devs without loading them down with incidental complexity. But I experienced exactly this misinterpretation at the first place I worked that had embraced DevOps culture.