The Art of Right-Sizing: A Framework for Infrastructure Cost Decisions

February 2025

Cloud cost optimization articles typically provide the same advice: use reserved instances, turn off unused resources, right-size your instances. This is correct but incomplete. It treats cost optimization as a technical exercise when it is fundamentally a risk management problem.

Senior engineers understand that the cheapest option is not always the right option. The right option is the one that balances cost, risk, operational overhead, and business constraints. This post provides a framework for making those decisions.

The Real Problem

A client runs a PHP application on a c5.xlarge instance. The application uses 5% CPU. The obvious advice is to downsize aggressively—move to a t3.medium and save 75%.

But the application license is tied to the host. Testing is not possible because it is a production system. The client cannot afford downtime.

Now the problem is different. The question is not "what is the cheapest instance?" but "what is the appropriate risk-adjusted choice given these constraints?"

A Decision Framework

Step 1: Understand the Constraints

Before optimizing anything, enumerate what you cannot change:

Technical constraints:

Can you test changes before production? If not, you need conservative choices.
Are there licensing restrictions? Some software is licensed per-core or per-host.
Does the workload require specific CPU features? ARM instances are cheaper but not universally compatible.
What are the performance characteristics? Burstable instances work for variable workloads but not for sustained high CPU.

Business constraints:

What is the cost of downtime? A $100/month savings means nothing if it causes a $10,000 outage.
Who owns this decision? Sometimes the cheapest technical choice is politically expensive.
What is the change management process? Some organizations cannot make infrastructure changes quickly.

Operational constraints:

Who will monitor this after the change? Aggressive optimization requires active monitoring.
What is the rollback plan? If something goes wrong, how quickly can you revert?
Do you have the expertise to manage the alternative? Spot instances are cheap but require understanding interruption handling.

Step 2: Categorize the Workload

Not all workloads should be optimized the same way:

Production, customer-facing, revenue-generating:
Optimize conservatively. The cost of failure exceeds the cost of over-provisioning. Maintain headroom. Use reserved capacity for predictable workloads.

Production, internal, non-critical:
Moderate optimization. Right-size based on actual usage. Consider spot instances with proper fallback. Accept some risk of degraded performance.

Development and testing:
Aggressive optimization. Use spot instances. Implement aggressive scheduling (turn off at night and weekends). Accept that environments may occasionally be unavailable.

Temporary or experimental:
Maximum optimization. On-demand only (no commitments). Delete aggressively when done.

Step 3: Calculate the Risk-Adjusted Cost

Raw instance pricing is misleading. Consider:

The cost of monitoring:
Aggressive optimization requires active monitoring. CPU credit balances, spot interruption rates, and utilization patterns all need attention. If no one is watching, conservative over-provisioning is cheaper than the eventual outage.

The cost of incidents:
If right-sizing causes one production incident per year, what does that incident cost? Include engineering time, customer impact, reputation damage, and any SLA penalties.

The cost of complexity:
Spot instances with fallback to on-demand, mixed instance fleets, and savings plan portfolios all require expertise to manage. That expertise has a cost, whether it is hiring, training, or consultant fees.

The cost of lock-in:
Reserved instances and savings plans provide significant discounts but reduce flexibility. If your workload might change, the discount may not be worth the commitment.

Practical Patterns

The Conservative Migration

When you cannot test and downtime is unacceptable:

Provision the new, smaller instance alongside the existing one
Use weighted DNS or load balancer rules to shift traffic gradually (10%, then 25%, then 50%, then 100%)
Monitor aggressively during the transition
Keep the old instance running for immediate rollback
Only decommission the old instance after a stability period

This approach costs more during the transition but eliminates the risk of a hard cutover.

The Graduated Commitment

When you are unsure about long-term workload patterns:

Start with on-demand instances
After 3 months of stable usage, purchase 1-year no-upfront reserved instances or compute savings plans
After 12 months, evaluate whether 3-year commitments make sense
Never commit more than 70% of your baseline—leave room for optimization and change

This approach leaves money on the table compared to immediate 3-year commitments but protects against workload changes.

The Scheduling Pattern

For development environments and internal tools:

Implement start/stop schedules based on working hours
Use Lambda or EventBridge to automate the schedules
Provide a self-service mechanism for engineers to extend running hours when needed
Track exceptions to identify environments that should run longer by default

A typical schedule (8 AM to 8 PM weekdays) reduces costs by approximately 65% compared to 24/7 operation.

The Spot Strategy

For fault-tolerant workloads:

Never use spot for stateful workloads without careful architecture
Diversify across instance types and availability zones to reduce interruption risk
Implement graceful shutdown handling (use the 2-minute warning)
Always have on-demand fallback capacity defined
Monitor spot pricing trends—when spot approaches on-demand pricing, switch to on-demand

What Senior Engineers Do Differently

Junior engineers optimize for the metric they are given. If the goal is to reduce costs, they reduce costs. If that causes an outage, the goal was wrong.

Senior engineers optimize for outcomes. They understand that:

The goal is not minimum cost. The goal is appropriate cost. A system that costs $1,000/month and never fails is often better than one that costs $500/month and fails quarterly.

Cost optimization is ongoing, not one-time. Workloads change. Pricing changes. What was optimal six months ago may not be optimal today. Build review cadences into your process.

Visibility precedes optimization. You cannot optimize what you cannot measure. Before cutting costs, implement tagging, cost allocation, and monitoring. Many organizations find that simply knowing where money goes changes behavior enough to reduce costs.

Organizational context matters. The technically optimal solution may not be the organizationally optimal solution. A 20% savings that requires ongoing manual intervention may not be worth it for a team that is already stretched thin.

The Questions to Ask

Before making any infrastructure cost decision, work through these questions:

What is the blast radius if this goes wrong?
Can we test this change before production?
Who will monitor this after implementation?
What is the rollback plan and how long does it take?
What constraints limit our options?
Is this a one-time optimization or does it require ongoing management?
What is the total cost of ownership, including operational overhead?

If you cannot answer these questions, you are not ready to make the change.

Conclusion

Cost optimization is risk management in disguise. The senior engineer's job is not to find the cheapest option but to find the appropriate option given constraints, risk tolerance, and operational capacity.

Sometimes that means aggressive right-sizing. Sometimes it means deliberately over-provisioning to buy peace of mind. The skill is knowing which approach fits which situation—and having the judgment to choose appropriately when the answer is not obvious.

← Back to Blog