How we find cost anomalies in every AWS account we audit
Most companies do not have a cloud cost problem.
They have a visibility problem. And an ownership problem. And a timing problem.
When we get access to a new AWS account, we almost always find money being wasted. Not small amounts. Real spend that nobody can explain.
Sometimes it is a few hundred dollars a month. Sometimes it is thousands. In one case, it was over ten thousand a month from a single issue.
And almost every time, the response is the same.
"We didn't know this was happening."
That is the real problem.
Cost anomalies are not rare. They happen all the time. What is rare is a system that catches them early and forces someone to act.
This is how we approach it in every audit.
What we mean by anomaly
An anomaly is not just a high number. It is a change that nobody expected.
A jump in EC2 usage overnight.
A database that slowly gets more expensive every week.
Traffic leaving your system when it should not.
Resources that are still running even though nobody uses them.
The number itself does not matter as much as the behavior. If something changed and nobody knows why, that is an anomaly.
Why most teams never see them
Most teams do not ignore cost. They just look at it too late.
They check the bill at the end of the month.
They look at totals instead of trends.
They assume someone else is watching it.
By the time someone notices, the money is already gone.
Cloud cost needs to be treated the same way you treat production systems. You would not check logs once a month and hope everything is fine.
The part nobody likes to talk about
Tools are not the main problem.
You can turn on all the dashboards in the world and still miss anomalies.
The real issue is that most teams cannot answer three simple questions:
What changed?
Who owns it?
Why did it change?
If those answers are not obvious, anomaly detection will fail.
What we fix before we look at anything
Before we open a single dashboard, we check the basics. If these are not in place, everything else becomes harder.
Ownership
Every cost needs an owner. Not a team name. Not a shared inbox. A real person or a clearly defined service owner.
If nobody owns it, nobody fixes it.
Tagging
At a minimum, every resource should tell you three things:
- What it belongs to
- Which environment it is in
- Who is responsible for it
When this is missing, you end up chasing costs manually across services.
Cost data
We enable detailed cost reporting: the Cost and Usage Report with daily or hourly granularity, stored in S3. Without this, you are working with summaries, not real signals.
The tools we use in AWS
We keep this simple. The same setup works in almost every account:
- AWS Cost Explorer
- AWS Cost Anomaly Detection
- AWS Budgets
- Cost and Usage Report (CUR)
- Amazon Athena
Nothing fancy. The difference is not the tools. It is how they are used.
One practical note: AWS Cost Anomaly Detection sends to SNS, not directly to humans. You need to wire that up to an email alias, a Slack webhook, or whatever channel your team actually reads. The default "turn it on and walk away" configuration doesn't alert anyone.
How we actually find anomalies
This is the process we follow every time.
Step 1: Look for change, not totals
We start in Cost Explorer. Compare yesterday to the day before. Compare last week to the week before. Sort by difference.
You are not looking for the biggest service. You are looking for what moved.
Step 2: Narrow it down quickly
Once something stands out, we isolate it: which service changed, which usage type increased. This usually takes a few minutes.
Step 3: Map it to something real
Now we connect cost to resources: which instance, which database, which environment. If tagging is good, this is fast. If not, this is where time gets wasted.
Step 4: Ask the only question that matters
Was this expected?
That question solves most of the problem. If the answer is yes, we move on. If the answer is no, we dig deeper.
What we actually find in real audits
Four patterns show up again and again.
The forgotten environment
One account had a staging environment running for months. Nobody was using it. It had multiple instances, a database, and storage. Everything was left on after a testing cycle.
No alert. No owner. No review process. It was found only because we compared weekly cost and saw a flat line that should not exist.
The slow database creep
Another case was less obvious. The database cost was increasing slowly every week. No spike. No alert. Just a steady climb.
When we looked closer, storage was growing and IO usage was increasing. Old data was never cleaned up. The team never noticed because nothing broke. Over time, that turned into a few thousand dollars a month in unnecessary spend.
The data transfer surprise
Costs suddenly increased without any obvious change in traffic. It turned out a service was routing requests in a way that caused unnecessary outbound traffic.
Nothing in the application changed from a feature perspective. But the architecture created a hidden cost. That alone was several thousand dollars per month.
The idle compute case
This one shows up everywhere. Instances running at low utilization for weeks. CPU barely moving. Nobody shuts them down because nothing is technically broken.
But they keep costing money.
When anomaly detection is working, the behavior of the team changes. People stop reacting to bills. They start asking questions earlier. Ownership becomes clear. Small issues get fixed before they become large ones.
How we make it continuous
Finding problems once is not enough. You need something that runs all the time.
Alerts that matter
We configure Cost Anomaly Detection with account-level and service-level monitors. Alerts go to real channels where people see them, not to SNS topics nobody is subscribed to.
Weekly review
Once a week, someone looks at what increased, what is new, and what looks different. This takes less than thirty minutes.
Clear ownership
Every service has an owner. If something spikes, we know exactly who to ask.
Basic metrics
We track a few simple things:
- How long it takes to detect an anomaly
- How long it takes to fix it
- How much of the environment is properly tagged
These tell you if your process is working.
Automation over time
Once patterns repeat, we automate: shutdown schedules for non-production, tag enforcement, guardrails for new resources. This reduces noise and prevents the same issues from coming back.
Cost anomalies are not rare events. They are normal. What is rare is catching them early.
If you build a system where cost is visible, owned, and reviewed regularly, you stop being surprised.
And once you stop being surprised, you start being in control.
Think your AWS bill has money hidden in it?
Free 30-minute call: tell us what you're spending and what doesn't feel right. We'll share which of these patterns to look for first and what's worth investigating. No pressure to hire us. If what you need is a one-week audit, we'll say so. If it's a 30-minute conversation, that works too.
Book a 30-min call →