Wednesday, November 28, 2018

AWS: Some Tips for Avoiding Those "Holy Bill" Moments

Cloud is awesome: almost-100% availability, near-zero maintenance, pay-as-you-go, and above all, infinitely scalable.

But the last two can easily bite you back, turning that awesomeness into a billing nightmare.

And occasionally you see stories like:

Within a week we accumulated a bill close to $10K.

Holy Bill!

And here I unveil a few tips that we learned from our not-so-smooth journey of building the world's first serverless IDE, that could help others to avoid some "interesting" pitfalls.

Careful with that config!

One thing we learned was to never underestimate the power of a configuration.

If you read the above linked article you would have noticed that it was a simple misconfiguration: a CloudTrail logging config that was writing logs to one of the buckets it was already monitoring.

You could certainly come up with more elaborate and creative examples of creating "service loops" yielding billing black-holes, but the idea is simple: AWS is only as intelligent as the person who configures it.

Infinite loop

(Well, in the above case it was one of my colleagues who configured it, and I was the one who validated it; so you can stop here if you feel like it ;) )

So, when you're about to submit a new config update, try to rethink the consequences. You won't regret it.

It's S3, not your attic.

AWS has estimated that 7% of cloud billing is wasted on "unused" storage - space taken up by content of no practical use: obsolete bundles, temporary uploads, old hostings, and the like.

Life in a bucket

However, it is true that cleaning up things is easier said than done. It is way too easy to forget about an abandoned file than to keep it tracked and delete it when the time comes.

Probably for the same reason, S3 has provided lifecycle configurations - time-based automated cleanup scheduling. You can simply say "delete this if it is older than 7 days", and it will be gone in 7 days.

This is an ideal way to keep temporary storage (build artifacts, one-time shares etc.) in check, hands-free.

Like the daily garbage truck.

Lifecycle configs can also become handy when you want to delete a huge volume of files from your bucket; rather than deleting individual files (which in itself would incur API costs - while deletes are free, listing is not!), you can simply set up a lifecycle config rule to expire everything in 1 day. Sit back and relax, while S3 does the job for you!

{
    "Rules": [
        {
            "Status": "Enabled",
            "Prefix": "",
            "Expiration": {
                "Days": 1
            }
        }
    ]
}

Alternatively you can move the no-longer-needed-but-not-quite-ready-to-let-go stuff into Glacier, for a fraction of the storage cost; say, for stuff under the subpath archived:

{
    "Rules": [
        {
            "Filter": {
                "Prefix": "archived"
            },
            "Status": "Enabled",
            "Transitions": [
                {
                    "Days": 1,
                    "StorageClass": "GLACIER"
                }
            ]
        }
    ]
}

But before you do that...

Ouch, it's versioned!

(Inspired by true events.)

I put up a lifecycle config to delete about 3GB of bucket access logs (millions of files, obviously), and thought everything was good - until, a month later, I got the same S3 bill as the previous month :(

Turns out that the bucket had had versioning enabled, so deletion does not really delete the object.

So with versioning enabled, you need to explicitly tell the S3 lifecycle logic to:

in order to completely get rid of the "deleted" content and the associated delete markers.

So much for "simple" storage service ;)

CloudWatch is your pal

Whenever you want to find out the total sizes occupied by your buckets, just iterate through your AWS/S3 CloudWatch Metrics namespace. There's no way—suprise, surprise—to check bucket size natively from S3; even the S3 dashboard relies on CloudWatch, so why not you?

Quick snippet to view everything? (uses aws-cli and bc on bash)

yesterday=$(date -d @$((($(date +%s)-86400))) +%F)
for bucket in `aws s3api list-buckets --query 'Buckets[*].Name' --output text`; do
        size=$(aws cloudwatch get-metric-statistics --namespace AWS/S3 --start-time ${yesterday}T00:00:00 --end-time $(date +%F)T00:00:00 --period 86400 --metric-name BucketSizeBytes --dimensions Name=StorageType,Value=StandardStorage Name=BucketName,Value=$bucket --statistics Average --output text --query 'Datapoints[0].Average')
        if [ $size = "None" ]; then size=0; fi
        printf "%8.3f  %s\n" $(echo $size/1048576 | bc -l) $bucket
done

EC2: sweep the garbage, plug the holes

EC2 makes it trivial to manage your virtual machines - compute, storage and networking. However, its simplicity also means that it can leave a trail of unnoticed garbage and billing leaks.

EC2

Pick your instance type

There's a plethora of settings when creating a new instance. Unless there are specific performance requirements, picking a T2-class instance type with Elastic Block Store (EBS)-backed storage and 2-4 GB of RAM would suffice for most needs.

Despite being free tier-eligible, t2.micro can be a PITA if your server could receive compute-or memory-intensive loads at some point; in these cases t2.micro tends to simply freeze (probably has to do with running out of CPU credits?), causing more trouble than it's worth.

Clean up AMIs and snapshots

We habitually tend to take periodic snapshots of our EC2 instances as backups. Some of these are made into Machine Images (AMIs) for reuse or sharing with other AWS users.

We easily forget about the other snapshots.

While snapshots don't get billed for their full volume sizes, they can add up to significant garbage over time. So it is important to periodically visit and clean up your EC2 snapshots tab.

Moreover, creating new AMIs would usually mean that older ones become obsolete; they can be "deregistered" from the AMIs tab as well.

But...

Who's the culprit - AMI or snapshot?

The actual charges are on snapshots, not on AMIs themselves.

And it gets tricky because deregistering an AMI does not automatically delete the corresponding snapshot.

You usually have to copy the AMI ID, go to snapshots, look for the ID in the description field, and nuke the matching snapshot. Or, if you are brave (and lazy), select and delete all snapshots; AWS will prevent you from deleting the ones that are being used by an AMI.

Likewise, for instances and volumes

Compute is billed while an EC2 instance is running; but its storage volume is billed all the time - right up to deletion.

Volumes usually get nuked when you terminate an instance; however, if you've played around with volume attachment settings, there's a chance that detached volumes are left behind in your account. Although not attached to an instance, these still occupy space; and so AWS charges for them.

Again, simply go to the volumes tab, select the volumes in "available" state, and hit delete to get rid of them for good.

Tag your EC2 stuff: instances, volumes, snapshots, AMIs and whatnot

Tag 'em

It's very easy to forget what state was in the instance, at the time that snapshot was made. Or the purpose of that running/stopped instance which nobody seems to take ownership or responsibility of.

Naming and tagging can help avoid unpleasant surprises ("Why on earth did you delete that last month's prod snapshot?!"); and also help you quickly decide what to toss ("We already have an 11-05 master snapshot, so just delete everything older than that").

You stop using, and we start billing!

Sometimes, the AWS Lords work in mysterious ways.

For example, Elastic IP Addresses (EIPs) are free as long as they are attached to a running instance. But they start getting charged by the hour, as soon as the instance is stopped; or if they get into a "detached" state (not attached to a running instance) in some way.

Some prior knowledge about the service you're about to sign up for, can prevent some nasty surprises of this fashion. A quick pricing page lookup or google can be a deal-breaker.

Pay-per-use vs pay-per-allocation

Many AWS services follow one or both of the above patterns. The former is trivial (you simply pay for the time/resources you actually use, and enjoy a zero bill for the rest of the time) and hard to miss; but the latter can be a bit obscure and quite easily go unnoticed.

Consider EC2: you mainly pay for instance runtime but you also pay for the storage (volumes, snapshots, AMIs) and network allocations (like inactive Elastic IPs) even if your instance has been stopped for months.

There are many more examples, especially in the serverless domain (which we ourselves are incidentally more familiar with):

Each block adds a bit more to your cost.

Meanwhile, some services secretly set up their own monitoring, backup and other "utility" entities. These, although (probably!) meant to do good, can secretly seep into your bill:

These are the main culprits that often appear in our AWS bills; certainly there are better examples, but you get the point.

CloudWatch (yeah, again)

Many services already—or can be configured to—report usage metrics to CloudWatch. Hence, with some domain knowledge of which metric maps into which billing component (e.g. S3 storage cost is represented by the summation of the BucketSizeBytes metric across all entries of the AWS/S3 namespace), you can build a complete billing and monitoring solution around CloudWatch Metrics (or delegate the job to a third-party service like DataDog).

CloudWatch

CloudWatch in itself is mostly free, and its metrics have automatic summarization mechanisms so you don't have to worry about overwhelming it with age-old garbage—or getting overwhelmed with off-the-limit capacity bills.

The Billing API

Although AWS does have a dedicated Billing Dashboard, logging in and checking it every single day is not something you would add to your agenda (at least not for API/CLI minds like you and me).

Luckily, AWS offers a billing API whereby you can obtain a fairly granular view of your current outstanding bill, over any preferred time period - broken down by services or actual API operations.

Catch is, this API is not free: each invocation costs you $0.01. Of course this is negligible - considering the risk of having to pay several dozens—or even hundreds or thousands in some cases—it is worth having a $0.30/month billing monitor to track down any anomalies before it's too late.

Food for thought: with support for headless Chrome offered for Google Cloud Functions, one might be able to set up a serverless workflow that logs into the AWS dashboard and checks the bill for you. Something to try out during free time (if some ingenious folk hasn't hacked it together already).

Billing alerts

Strangely (or perhaps not ;)) AWS doesn't offer a way to put up a hard limit for billing; despite the numerous user requests and disturbing incident reports all over the web. Instead, they offer alerts for various billing "levels"; you can subscribe for notifications like "bill at x% of the limit" and "limit exceeded", via email or SNS (handy for automation via Lambda!).

My advice: this is a must-have for every AWS account. If we had one in place, we could already have saved well over thousands of dollars to date.

Credit cards

Organizational accounts

If you want to delegate AWS access to third parties (testing teams, contract-basis devs, demo users etc.), it might be a good idea to create a sub-account by converting your root account into an AWS organization with consolidated billing enabled.

(While it is possible to do almost the same using an IAM user, it will not provide resource isolation; everything would be stuffed in the same account, and painstakingly complex IAM policies may be required to isolate entities across users.)

Our CEO and colleague Asankha has written about this quite comprehensively so I'm gonna stop at that.

And finally: Monitor. Monitor. Monitor.

No need to emphasize on this - my endless ramblings should already have conveyed its importance.

So, good luck with that!

No comments: