Taking on select engagements · limited availability
Book a call →

How we find cost anomalies in every AWS account we audit

Most companies don't have a cloud cost problem. They have a visibility problem, an ownership problem, and a timing problem. When we get access to a new AWS account, we almost always find money being wasted that nobody can explain. This post is the process we run: what we fix before we look at anything, which AWS tools actually matter, and four patterns that show up in almost every audit.

Most companies do not have a cloud cost problem.

They have a visibility problem. And an ownership problem. And a timing problem.

When we get access to a new AWS account, we almost always find money being wasted. Not small amounts. Real spend that nobody can explain.

Sometimes it is a few hundred dollars a month. Sometimes it is thousands. In one case, it was over ten thousand a month from a single issue.

And almost every time, the response is the same.

"We didn't know this was happening."

That is the real problem.

Cost anomalies are not rare. They happen all the time. What is rare is a system that catches them early and forces someone to act.

This is how we approach it in every audit.


What we mean by anomaly

An anomaly is not just a high number. It is a change that nobody expected.

A jump in EC2 usage overnight.

A database that slowly gets more expensive every week.

Traffic leaving your system when it should not.

Resources that are still running even though nobody uses them.

The number itself does not matter as much as the behavior. If something changed and nobody knows why, that is an anomaly.


Why most teams never see them

Most teams do not ignore cost. They just look at it too late.

They check the bill at the end of the month.

They look at totals instead of trends.

They assume someone else is watching it.

By the time someone notices, the money is already gone.

Cloud cost needs to be treated the same way you treat production systems. You would not check logs once a month and hope everything is fine.


The part nobody likes to talk about

Tools are not the main problem.

You can turn on all the dashboards in the world and still miss anomalies.

The real issue is that most teams cannot answer three simple questions:

What changed?

Who owns it?

Why did it change?

If those answers are not obvious, anomaly detection will fail.


What we fix before we look at anything

Before we open a single dashboard, we check the basics. If these are not in place, everything else becomes harder.

Ownership

Every cost needs an owner. Not a team name. Not a shared inbox. A real person or a clearly defined service owner.

If nobody owns it, nobody fixes it.

Tagging

At a minimum, every resource should tell you three things:

  • What it belongs to
  • Which environment it is in
  • Who is responsible for it

When this is missing, you end up chasing costs manually across services.

Cost data

We enable detailed cost reporting: the Cost and Usage Report with daily or hourly granularity, stored in S3. Without this, you are working with summaries, not real signals.


The tools we use in AWS

We keep this simple. The same setup works in almost every account:

  • AWS Cost Explorer
  • AWS Cost Anomaly Detection
  • AWS Budgets
  • Cost and Usage Report (CUR)
  • Amazon Athena

Nothing fancy. The difference is not the tools. It is how they are used.

One practical note: AWS Cost Anomaly Detection sends to SNS, not directly to humans. You need to wire that up to an email alias, a Slack webhook, or whatever channel your team actually reads. The default "turn it on and walk away" configuration doesn't alert anyone.


How we actually find anomalies

This is the process we follow every time.

Step 1: Look for change, not totals

We start in Cost Explorer. Compare yesterday to the day before. Compare last week to the week before. Sort by difference.

You are not looking for the biggest service. You are looking for what moved.

Step 2: Narrow it down quickly

Once something stands out, we isolate it: which service changed, which usage type increased. This usually takes a few minutes.

Step 3: Map it to something real

Now we connect cost to resources: which instance, which database, which environment. If tagging is good, this is fast. If not, this is where time gets wasted.

Step 4: Ask the only question that matters

Was this expected?

That question solves most of the problem. If the answer is yes, we move on. If the answer is no, we dig deeper.


What we actually find in real audits

Four patterns show up again and again.

~$4,000/month

The forgotten environment

One account had a staging environment running for months. Nobody was using it. It had multiple instances, a database, and storage. Everything was left on after a testing cycle.

No alert. No owner. No review process. It was found only because we compared weekly cost and saw a flat line that should not exist.

A few thousand/month

The slow database creep

Another case was less obvious. The database cost was increasing slowly every week. No spike. No alert. Just a steady climb.

When we looked closer, storage was growing and IO usage was increasing. Old data was never cleaned up. The team never noticed because nothing broke. Over time, that turned into a few thousand dollars a month in unnecessary spend.

Several thousand/month

The data transfer surprise

Costs suddenly increased without any obvious change in traffic. It turned out a service was routing requests in a way that caused unnecessary outbound traffic.

Nothing in the application changed from a feature perspective. But the architecture created a hidden cost. That alone was several thousand dollars per month.

Recurring monthly

The idle compute case

This one shows up everywhere. Instances running at low utilization for weeks. CPU barely moving. Nobody shuts them down because nothing is technically broken.

But they keep costing money.

When anomaly detection is working, the behavior of the team changes. People stop reacting to bills. They start asking questions earlier. Ownership becomes clear. Small issues get fixed before they become large ones.


How we make it continuous

Finding problems once is not enough. You need something that runs all the time.

Alerts that matter

We configure Cost Anomaly Detection with account-level and service-level monitors. Alerts go to real channels where people see them, not to SNS topics nobody is subscribed to.

Weekly review

Once a week, someone looks at what increased, what is new, and what looks different. This takes less than thirty minutes.

Clear ownership

Every service has an owner. If something spikes, we know exactly who to ask.

Basic metrics

We track a few simple things:

  • How long it takes to detect an anomaly
  • How long it takes to fix it
  • How much of the environment is properly tagged

These tell you if your process is working.

Automation over time

Once patterns repeat, we automate: shutdown schedules for non-production, tag enforcement, guardrails for new resources. This reduces noise and prevents the same issues from coming back.


Cost anomalies are not rare events. They are normal. What is rare is catching them early.

If you build a system where cost is visible, owned, and reviewed regularly, you stop being surprised.

And once you stop being surprised, you start being in control.

Think your AWS bill has money hidden in it?

Free 30-minute call: tell us what you're spending and what doesn't feel right. We'll share which of these patterns to look for first and what's worth investigating. No pressure to hire us. If what you need is a one-week audit, we'll say so. If it's a 30-minute conversation, that works too.

Book a 30-min call