When Your Deployment Platform Becomes the Problem: Evaluating Overengineering
Every platform team wants to build something elegant. The result is often a deployment system that works perfectly for the team that built it and confuses everyone else. Distinguishing between necessary complexity and overengineering is a core skill for senior infrastructure engineers.
This post provides a framework for evaluating deployment platforms and identifying when the cure has become worse than the disease.
The Pattern
You inherit or are asked to evaluate a deployment system. The documentation is extensive. There are custom tools, proprietary CLIs, bespoke configuration formats, and multiple layers of abstraction. The team that built it speaks about it with pride.
But something feels off. Simple changes require multiple pull requests across different repositories. New engineers take months to become productive. When deployments fail, debugging requires understanding three or four different systems.
The question is whether this complexity is justified or whether the platform has become its own problem.
Red Flags to Watch For
Custom Solutions for Solved Problems
Every major CI/CD platform provides environment variables, secrets management, approval gates, and audit logging. When a team builds custom tooling to replicate these features, ask why.
Common justifications that often do not hold up:
"We needed more control over the approval workflow." GitHub Environments, GitLab protected environments, and similar features handle approval gates with full audit trails. Custom approval systems add maintenance burden without proportional benefit.
"We needed centralized configuration management." Tools like Helm, Kustomize, or even simple per-environment YAML files handle this without custom merge logic.
"We needed artifact immutability." Tagged Docker images and standard release conventions provide this. SHA pinning is good practice, but you do not need a custom "freeze" mechanism to achieve it.
The legitimate cases for custom tooling are rare: unusual compliance requirements, genuinely novel deployment patterns, or scale that breaks standard tools. Most organizations are not Netflix or Google.
Configuration Merge Complexity
A three-layer configuration system where Service Profiles merge with Environment Profiles merge with Bounded Profiles sounds elegant in documentation. In practice:
The debugging story becomes painful. When a value is wrong in production, you trace through three files in different repositories to understand what set it. Was it the developer's service config? The platform team's environment defaults? The per-service-per-environment overrides?
Type coercion introduces subtle bugs. If one layer specifies cpu: "512" (string) and another specifies cpu: 512 (integer), the merge behavior depends on implementation details that developers should not need to understand.
IDE support breaks. Developers see schema validation errors for fields that will be provided by other layers. When the tooling tells developers to "ignore the errors," that is a design smell.
Most organizations get by fine with straightforward per-environment configuration files. The conceptual elegance of layered merging rarely survives contact with operational reality.
Environment Proliferation
When you see seven environments (dev2, test4, test1, test3, test2, sandbox1, prod1), ask what each one is for. Often the answer reveals organizational dysfunction rather than technical requirements.
Common patterns:
"Test1 and test3 are for the Test stage, test4 is for integration testing, test2 is deprecated but still running." This is not architecture; this is archaeology.
"We have separate UAT and Sandbox environments for different approval workflows." Unless there is a compliance requirement, this is process complexity masquerading as infrastructure.
"Developers need isolated environments." Fair—but you can achieve isolation with namespaces, feature branches, or ephemeral environments rather than permanent infrastructure with dedicated configuration.
The baseline should be three environments: development, staging, production. Additional environments require explicit justification tied to business or compliance requirements.
Dependency on Tribal Knowledge
The ultimate test: how long does it take a competent engineer unfamiliar with the system to ship a change to production?
If the answer involves reading 15 documents, understanding three custom CLIs, knowing which Slack channels to ask in, and learning unwritten conventions, the system has failed its primary purpose.
Manual recovery procedures are especially revealing. If fixing a failed deployment requires:
- Clone three repositories
- Set up a Python virtual environment
- Install a custom CLI
- Generate merged configuration files locally
- Copy files to specific locations in another repository
- Run Terraform manually
- Update a Release Manager UI to reflect what you did
...then the platform has created operational burden, not reduced it.
Questions to Ask When Evaluating
What is the bus factor?
Who built this, and are they still here? Custom Backstage plugins, proprietary CLIs, and bespoke configuration systems represent ongoing maintenance commitments. If the team that built them leaves, who maintains them?
What is the onboarding story?
How long does a new engineer take to become productive? If the answer is months rather than days, the abstraction is not working.
What happens when production breaks at 2 AM?
How many systems does someone need to check? How many tools do they need to use? The debugging surface area often reveals hidden complexity.
What is the migration path?
If this system becomes unmaintainable, how hard is it to move to something else? Lock-in to custom tooling is a real cost.
Could you achieve 90% of this with standard tools?
GitHub Actions or GitLab CI with environment protection rules. Helm or Kustomize for templating. Standard audit logging. Tagged releases for immutability. If the answer is yes, the custom tooling needs strong justification.
When Complexity Is Justified
Not all complex systems are overengineered. Legitimate complexity drivers include:
Regulatory requirements. Financial services, healthcare, and government contexts sometimes mandate specific approval workflows, audit granularity, or environment isolation that standard tools cannot provide.
Genuine scale. Hundreds of services with dozens of teams may need centralized configuration management and automated guardrails that smaller organizations do not.
Unusual deployment patterns. Edge computing, multi-region active-active, or hardware-integrated systems may require custom orchestration.
The key is whether the complexity addresses a real problem that simpler solutions cannot solve, or whether it addresses problems the team imagined they might have someday.
The Conversation to Have
When evaluating a complex deployment system, the goal is not to criticize the people who built it. Platform teams usually have good intentions and are solving real problems as they understand them.
The productive conversation is about total cost of ownership:
"This system works well for the team that built it. What happens when they move on? How long does onboarding take? What is the debugging story when things fail?"
Most overengineered platforms can be incrementally simplified. The first step is acknowledging that complexity has costs and that those costs compound over time.
Conclusion
Senior engineers add value not by building complex systems but by knowing when complexity is necessary and when it is not. The deployment platform that requires a hundred pages of documentation, three custom tools, and months of onboarding is not necessarily more capable than one that uses standard CI/CD features with a few well-chosen conventions.
When evaluating infrastructure, ask what problems the complexity actually solves and whether simpler approaches could achieve similar outcomes. The best systems are the ones that new engineers can understand quickly and that fail in obvious, debuggable ways.