Tuesday, May 28, 2019

AWS Lambda Event Source Mappings: bringing your triggers to order from chaos

Event-driven: it's the new style. (ShutterStock)

Recently we introduced two new AWS Lambda event sources (trigger types) for your serverless projects on Sigma cloud IDE: SQS queues and DynamoDB Streams. (Yup, AWS introduced them months ago; but we're still a tiny team, caught up in a thousand and one other things as well!)

While developing support for these triggers, I noticed a common (and yeah, pretty obvious) pattern on Lambda event source trigger configurations; that I felt was worth sharing.

Why AWS Lambda triggers are messed up

Lambda - or rather AWS - has a rather peculiar and disorganized trigger architecture; to put it lightly. For different trigger types, you have to put up configurations all over the place; targets for CloudWatch Events rules, integrations for API Gateway endpoints, notification configurations for S3 bucket events, and the like. Quite a mess, considering other platforms like GCP where you can configure everything in one place: the "trigger" config of the actual target function.

Configs. Configs. All over the place.

If you have used infrastructure as code (IAC) services like CloudFormation (CF) or Terraform (TF), you would already know what I mean. You need mappings, linkages, permissions and other bells and whistles all over the place to get even a simple HTTP URL working. (SAM does simplify this a bit, but it comes with its own set of limitations - and we have tried our best to avoid such complexities in our Sigma IDE.)

Maybe this is to be expected, given the diversity of services offered by AWS, and their timeline (Lambda, after all, is just a four-year-old kid). AWS should surely have had to do some crazy hacks to support triggering Lambdas from so many diverse services; and hence the confusing, scattered configurations.

Event Source Mappings: light at the end of the tunnel?

Event Source Mappings: light at the end of the tunnel (ShutterStock)

Luckily, the more recently introduced, stream-type triggers follow a common pattern:

This way, you know exactly where you should configure the trigger, and how you should allow the Lambda to consume the event stream.

No more jumping around.

This is quite convenient when you are based on an IAC like CloudFormation:

{
  ...

    // event source (SQS queue)

    "sqsq": {
      "Type": "AWS::SQS::Queue",
      "Properties": {
        "DelaySeconds": 0,
        "MaximumMessageSize": 262144,
        "MessageRetentionPeriod": 345600,
        "QueueName": "q",
        "ReceiveMessageWaitTimeSeconds": 0,
        "VisibilityTimeout": 30
      }
    },

    // event target (Lambda function)

    "tikjs": {
      "Type": "AWS::Lambda::Function",
      "Properties": {
        "FunctionName": "tikjs",
        "Description": "Invokes functions defined in \
tik/js.js in project tik. Generated by Sigma.",
        ...
      }
    },

    // function execution role that allows it (Lambda service)
    // to query SQS and remove read messages

    "tikjsExecutionRole": {
      "Type": "AWS::IAM::Role",
      "Properties": {
        "ManagedPolicyArns": [
          "arn:aws:iam::aws:policy/service-role/AWSLambdaBasicExecutionRole"
        ],
        "AssumeRolePolicyDocument": {
          "Version": "2012-10-17",
          "Statement": [
            {
              "Action": [
                "sts:AssumeRole"
              ],
              "Effect": "Allow",
              "Principal": {
                "Service": [
                  "lambda.amazonaws.com"
                ]
              }
            }
          ]
        },
        "Policies": [
          {
            "PolicyName": "tikjsPolicy",
            "PolicyDocument": {
              "Statement": [
                {
                  "Effect": "Allow",
                  "Action": [
                    "sqs:GetQueueAttributes",
                    "sqs:ReceiveMessage",
                    "sqs:DeleteMessage"
                  ],
                  "Resource": {
                    "Fn::GetAtt": [
                      "sqsq",
                      "Arn"
                    ]
                  }
                }
              ]
            }
          }
        ]
      }
    },

    // the actual event source mapping (SQS queue -> Lambda)

    "sqsqTriggertikjs0": {
      "Type": "AWS::Lambda::EventSourceMapping",
      "Properties": {
        "BatchSize": "10",
        "EventSourceArn": {
          "Fn::GetAtt": [
            "sqsq",
            "Arn"
          ]
        },
        "FunctionName": {
          "Ref": "tikjs"
        }
      }
    },

    // grants permission for SQS service to invoke the Lambda
    // when messages are available in our queue

    "sqsqPermissiontikjs": {
      "Type": "AWS::Lambda::Permission",
      "Properties": {
        "Action": "lambda:InvokeFunction",
        "FunctionName": {
          "Ref": "tikjs"
        },
        "SourceArn": {
          "Fn::GetAtt": [
            "sqsq",
            "Arn"
          ]
        },
        "Principal": "sqs.amazonaws.com"
      }
    }

  ...
}

(In fact, that was the whole reason/purpose of this post.)

Tip: You do not need to worry about this whole IAC/CloudFormation thingy - or writing lengthy JSON/YAML - if you go with a fully automated resource management tool like SLAppForge Sigma serverless cloud IDE.

But... are Event Source Mappings ready for the big game?

Ready for the Big Game? (Wikipedia)

They sure look promising, but it seems event source mappings do need a bit more maturity, before we can use them in fully automated, production environments.

You cannot update an event source mapping via IAC.

For example, even after more than four years from their inception, event sources cannot be updated after being created via an IaC like CloudFormation or Serverless Framework. This causes serious trouble; if you update the mapping configuration, you need to manually delete the old one and deploy the new one. Get it right the first time, or you'll have to run through a hectic manual cleanup to get the whole thing working again. So much for automation!

The event source arn (aaa) and function (bbb) provided mapping already exists. Please update or delete the existing mapping...

Polling? Sounds old-school.

There are other, less-evident problems as well; for one, event source mappings are driven by polling mechanisms. If your source is an SQS queue, the Lambda service will keep polling it until the next message arrives. While this is fully out of your hands, it does mean that you pay for the polling. Plus, as a dev, I don't feel that polling exactly fits into the event-driven, serverless paradigm. Sure, everything boils down to polling in the end, but still...

In closing: why not just try out event source mappings?

Event Source Mappings FTW! (AWS docs)

Ready or not, looks like event source mappings are here to stay. With the growing popularity of data streaming (Kinesis), queue-driven distributed processing and coordination (SQS) and event ledgers (DynamoDB Streams), they will become ever more popular as time passes.

You can try out how event source mappings work, via numerous means: the AWS console, aws-cli, CloudFormation, Serverless Framework, and the easy-as-pie graphical IDE SLAppForge Sigma.

Easily manage your event source mappings - with just a drag-n-drop!

In Sigma IDE you can simply drag-n-drop an event source (SQS queue, DynamoDB table or Kinesis stream) on to the event variable of your Lambda function code. Sigma will pop-up a dialog with available mapping configurations, so you can easily configure the source mapping behavior. You can even configure an entirely new source (queue, table or stream) instead of using an existing one, right there within the pop-up.

Sigma is the shiny new thing for serverless.

When deployed, Sigma will auto-generate all necessary configurations and permissions for your new event source, and publish them to AWS for you. It's all fully managed, fully automated and fully transparent.

Enough talk. Let's start!

Greasemonkey Script Recovery: getting the Great Monkey back on its feet

GreaseMonkey - the Greatest Script Monkey of All Time! (techspot.com)

You just lost all your Greasemonkey scripts? And the complete control of your browser, painfully built over several days of work? Looking all over the place for a Greasemonkey recovery technique posted by some tech magician?

Don't worry, I have been there too.

Damn you, BSOD! You Ate My Monkey! 😱

My OS - Windows 10 - crashed (that infamous ":(" BSOD) and restarted itself, and when I was back, I could see only a blank square where the Great Monkey used to be:

The Monkey was here.

My natural instinct was thinking that my copy of GM had outlived its max life (I have a notorious track record of running with obsolete extensions); so I just went ahead and updated the extension (all the way from April 2018).

And then:

The Monkey says: '   '

Well said. That's all I needed to hear.

Since I'm running Firefox Nightly, I didn't have the luxury of putting the blame on someone else:

Firefox Nightly: the 'insane' build from last night! (Twitter)

Nightly is experimental and may be unstable. - Dear ol' Mozilla

No Greasemonkey. Helpless. Powerless. In a world of criminals... 🐱‍💻

To champion the cause of the innocent, the helpless, the powerless, in a world of criminals who operate above the law. (GreenWiseDesign)

So, Knight Rider quotes aside, what are my options? Rewrite all my life's work?

Naah. Not yet.

The scripts are probably still there; for whatever reason, GM cannot read them now.

Hunting for the lost GM scripts

Earlier, GM scipts were being saved in plaintext somewhere inside the Firefox user profile directory; but not anymore, it seems.

So I picked up the 61413404gyreekansoem.sqlite file from {FireFox profile home}/storage/default/moz-extension+++d2345083-0b49-4052-ac13-2cefd89be9b5/idb/, opened it quick-n-dirty in a text editor, and searched for some of my old script fragments.

Phew, they are there! But they seem to be fragmented; possibly because it's SQLite.

sqlite3.exe to the rescue

So I fired up Android SDK's SQLite tool (I didn't have anything else lying around), and loaded the file:

sqlite> .open 'C:\Users\janaka\AppData\Roaming\Mozilla\Firefox\Profiles\a4And0w1D.default\storage\default\moz-extension+++d2345083-0b49-4052-ac13-2cefd89be9b5\idb\61413404gyreekansoem.sqlite'

sqlite> .dump

PRAGMA foreign_keys=OFF;
BEGIN TRANSACTION;
CREATE TABLE database( name TEXT PRIMARY KEY, ...
...
INSERT INTO object_data VALUES(1,X'30313137653167...00001300ffff');
INSERT INTO object_data VALUES(1,X'30313965386367...00001300ffff');
...

Looks good! So why can't GM read this?

Probably their data model changed so dramatically so that my DB file no longer makes any sense?

Let's reboot Greasemonkey ⚡

FF browser console was also spewing out several errors at load time; I didn't bother delving into the sources, but it looked like relaunching GM with the existing DB was a no-go.

So I moved the DB file to a safe location, deleted the whole extension data directory (C:\Users\janaka\AppData\Roaming\Mozilla\Firefox\Profiles\a4And0w1D.default\storage\default\moz-extension+++d2345083-0b49-4052-ac13-2cefd89be9b5\) and restarted FF.

(Correction: I did back up the whole extension data directory, but it is probably not necessary.)

Nice, the Great Monkey is back!

GreaseMonkey clean startup: at least The Monkey is back - but... but...

FF also opened up the Greasemonkey startup page indicating that it treated the whole thing as a clean install. The extension data directory was also back, with a fresh DB file.

Greasemonkey script migration: first attempt

So, the million-dollar question: how the heck am I gonna load my scripts back to the new DB?

First, let's see if the schemas have actually changed:

Old DB:

CREATE TABLE database( name TEXT PRIMARY KEY, origin TEXT NOT NULL, version INTEGER NOT NULL DEFAULT 0, last_vacuum_time INTEGER NOT NULL DEFAULT 0, last_analyze_time INTEGER NOT NULL DEFAULT 0, last_vacuum_size INTEGER NOT NULL DEFAULT 0) WITHOUT ROWID;
CREATE TABLE object_store( id INTEGER PRIMARY KEY, auto_increment INTEGER NOT NULL DEFAULT 0, name TEXT NOT NULL, key_path TEXT);
CREATE TABLE object_store_index( id INTEGER PRIMARY KEY, object_store_id INTEGER NOT NULL, name TEXT NOT NULL, key_path TEXT NOT NULL, unique_index INTEGER NOT NULL, multientry INTEGER NOT NULL, locale TEXT, is_auto_locale BOOLEAN NOT NULL, FOREIGN KEY (object_store_id) REFERENCES object_store(id) );

// skimmin...

CREATE INDEX index_data_value_locale_index ON index_data (index_id, value_locale, object_data_key, value) WHERE value_locale IS NOT NULL;
CREATE TABLE unique_index_data( index_id INTEGER NOT NULL, value BLOB NOT NULL, object_store_id INTEGER NOT NULL, object_data_key BLOB NOT NULL, value_locale BLOB, PRIMARY KEY (index_id, value), FOREIGN KEY (index_id) REFERENCES object_store_index(id) , FOREIGN KEY (object_store_id, object_data_key) REFERENCES object_data(object_store_id, key) ) WITHOUT ROWID;

// skim...

CREATE TRIGGER object_data_delete_trigger AFTER DELETE ON object_data FOR EACH ROW WHEN OLD.file_ids IS NOT NULL BEGIN SELECT update_refcount(OLD.file_ids, NULL); END;
CREATE TRIGGER file_update_trigger AFTER UPDATE ON file FOR EACH ROW WHEN NEW.refcount = 0 BEGIN DELETE FROM file WHERE id = OLD.id; END;

New DB:

CREATE TABLE database( name TEXT PRIMARY KEY, origin TEXT NOT NULL, version INTEGER NOT NULL DEFAULT 0, last_vacuum_time INTEGER NOT NULL DEFAULT 0, last_analyze_time INTEGER NOT NULL DEFAULT 0, last_vacuum_size INTEGER NOT NULL DEFAULT 0) WITHOUT ROWID;
CREATE TABLE object_store( id INTEGER PRIMARY KEY, auto_increment INTEGER NOT NULL DEFAULT 0, name TEXT NOT NULL, key_path TEXT);

// skim...

CREATE TABLE unique_index_data( index_id INTEGER NOT NULL, value BLOB NOT NULL, object_store_id INTEGER NOT NULL, object_data_key BLOB NOT NULL, value_locale BLOB, PRIMARY KEY (index_id, value), FOREIGN KEY (index_id) REFERENCES object_store_index(id) , FOREIGN KEY (object_store_id, object_data_key) REFERENCES object_data(object_store_id, key) ) WITHOUT ROWID;

// skim...

CREATE TRIGGER object_data_delete_trigger AFTER DELETE ON object_data FOR EACH ROW WHEN OLD.file_ids IS NOT NULL BEGIN SELECT update_refcount(OLD.file_ids, NULL); END;
CREATE TRIGGER file_update_trigger AFTER UPDATE ON file FOR EACH ROW WHEN NEW.refcount = 0 BEGIN DELETE FROM file WHERE id = OLD.id; END;

Well, looks like both are identical!

So theoretically, I should be able to dump data from the old DB (as SQL) and insert 'em to the new DB.

no such function: update_refcount? WTF?

Old DB:

sqlite> .dump

PRAGMA foreign_keys=OFF;
...
INSERT INTO object_data VALUES(1,X'3031...ffff');
...

New DB:

sqlite> INSERT INTO object_data VALUES(1,X'3031...ffff');

Error: no such function: update_refcount

Bummer.

Is FF actually using non-standard SQLite? I didn't wait to find out.

GreaseMonkey, hang in there buddy... I WILL fix ya! (me.me)

Greasemonkey Import/Export, to the rescue!

If you haven't seen or used it before, GM comes with two handy Export a backup and Import a backup options; they basically allow you to export all your scripts as a zip file, and later import them back.

And, from earlier, you would have guessed that our "barrier to entry" is most probably the SQLite trigger statement for AFTER INSERT ON object_data:

CREATE TRIGGER object_data_insert_trigger AFTER INSERT ON object_data FOR EACH ROW WHEN NEW.file_ids IS NOT NULL BEGIN SELECT update_refcount(NULL, NEW.file_ids); END;

So, what if I:

  • drop that trigger,
  • import the old data dump into the new DB,
  • fire up the fox,
  • use Greasemonkey's Export a backup feature to export (download) the recovered scripts,
  • shut down the fox,
  • delete the new DB,
  • restart the fox (so it creates a fresh DB), and
  • import the zip file back-up?

Yay! that worked! 🎉

The second step caused me a bit of confusion. I was copying the dumped data right off the SQLite screen and pasting it into the SQLite console of the new DB; but it appeared that some lines were too long for the console input, and those inserts were causing cascading failures. Hard to figure out when you copy a huge chunk with several inserts, paste and run them, and all - except the first few - fail miserably.

So I created a SQL file out of the dumped data, and imported the file via the .read command:

Old DB:

sqlite> .output 'C:\Windows\Temp\dump.sql'
sqlite> .dump
sqlite> .quit

Cleanup:

Now, open dump.sql and remove the DDL statements; CREATE TABLE, CREATE INDEX, CREATE TRIGGER, etc.

New DB:

sqlite> DROP TRIGGER object_data_insert_trigger;
DROP TRIGGER object_data_update_trigger;
DROP TRIGGER object_data_delete_trigger;
DROP TRIGGER file_update_trigger;

sqlite> .read 'C:\Windows\Temp\dump.sql'

All else went exactly as planned.

Now, now, not so fast! Calm down... Take a deep breath. It works. (CrazyFunnyPhotos)

Welcome back, O Great Monkey! 🙏

So, my Greasemonkey set up is back online, running as smoothly as ever!

Final words: if you ever hit an issue with GM:

  • don't panic.
  • back up your extension data directory right away.
  • start from scratch (get the monkey running), and
  • carefully feed the monkey (yeah, Greasemonkey) with the old configs again.

Monday, May 20, 2019

How I made AWS CLI 300% faster! [FULL disclosure]

Yeah yeah, it's "highly experimental" and all, but still it's 3 times faster than simply running aws bla bla bla, the "plain" aws-cli.

Full speed ahead - with <code>aws-cli</code>! (sashkin7)

And yes, it won't always be that fast, especially if you only run AWS CLI only about once a fortnight. But it will certainly have a clear impact once you start batching up your AWS CLI calls; maybe routine account checks/cleanups, maybe extracting tons of CloudWatch Metrics records; or maybe a totally different, unheard-of use case.

Whatever it is, I guess it would be useful for the masses some day.

Plus, as one of the authors and maintainers of the world's first serverless IDE, I have certainly had several chances to put it to good use!

The problem: why AWS CLI is "slow" for me

(Let's just call it "CLI", shall we?)

It's actually nothing to do with the CLI itself; rather it's the fact that each CLI invocation is a completely new program execution cycle.

This means:

But, as usual, the highest impact comes via network I/O:

  • the CLI has to create an API client from scratch (the previous one was lost when the command execution completed)
  • since the network connection to AWS is managed by the client, this means that each command creates (and then destroys) a fresh TCP connection to the AWS endpoint; which involves a DNS lookup as well (although later lookups may be served from the system cache)
  • since AWS APIs almost always use SSL, every new connection results in a full SSL handshake (client hello, server hello, server cert, yada yada yada)

Now, assume you have 20 CloudWatch Log Groups to be deleted. Since the Logs API does not offer a bulk deletion option, the cheapest way to do this would be to run a simple shell script - looping aws logs delete-log-group over all groups:

for i in $(aws logs describe-log-groups --query 'logGroups[*].logGroupName' --output text); do
    aws logs delete-log-group --log-group-name $i
done

This would run the CLI 20 times (21 to be precise, if you count the initial list API call); meaning that all of the above will run 20 times. Clearly a waste of time and resources, since we were quite clear that the same endpoint was going to be invoked in all those runs.

Maybe it's just repetition, after all. (MemeGenerator)

Try scaling this up to hundreds or thousands of batched operations; and see where it takes you!

And no, aws-shell does not cut it.

Not yet, at least.

Leaving aside the nice and cozy REPL interface (interactive user prompt), handy autocompletion, syntax coloring and inline docs, aws-shell does not give you any performance advantage over aws-cli. Every command in the shell is executed in a new AWS CLI instance - with parsers, command hierarchies, API specs and - more importantly API clients - getting recreated for every command.

Skeptical? Peek at the aws-shell sources; or better still, fire up Wireshark (or tcpdump if you dare), run a few commands in the shell REPL, and see how each command initializes a fresh SSL channel from scratch.

The proposal: what can we do?

Obviously, the CLI cannot do pretty much anything about it. It's a simple program, and whatever improvements we do, won't last until the next invocation. The OS would rudely wipe them and start the next CLI with a clean slate; unless we use some spooky (and rather discouraged) memory persistence magic to serialize and reload the CLI's state. Even then, the other OS-level stuff (network sockets etc.) will be gone, and our effort would be pretty much fruitless.

If we are going to make any impactful changes, we need to make the CLI stateful; a long-running process.

The d(a)emon

In the OS world, this usually means setting up a daemon - a background process that waits for and processes events like user commands. (A popular example is MySQL, with its mysql-server daemon and mysql-client packages.)

In our case, we don't want a fully-fledged "managed" daemon - like a system service. For example, there's no point in starting our daemon before we actually start making our CLI calls; also, if our daemon dies, there's no point in starting it right away; since we cannot recover the lost state anyway.

So we have a simple plan:

  • break the CLI into a "client" and daemon
  • every time we run the CLI,
    • check for the presence of the daemon, and
    • spawn the daemon if it is not already running

This way, if the daemon dies, the next CLI invocation will auto-start it. Nothing to worry, nothing to manage.

Our fast AWS CLI daemon - it's all in a subprocess!

It is easy to handle the daemon spawn without having the trouble of maintaining a second program or script; simply use subprocess.Popen to launch another instance of the program, and instruct it to run the daemon's code path, rather than the client's.

Enough talk; show me the code!

Enough talk; let's fight! (KFP, YouTube)

Here you go:

#!/usr/bin/python

import os
import sys
import tempfile
import psutil
import subprocess

rd = tempfile.gettempdir() + "/awsr_rd"
wr = tempfile.gettempdir() + "/awsr_wr"


def run_client():
	out = open(rd, "w")
	out.write(" ".join(sys.argv))
	out.write("\n")
	out.close()

	inp = open(wr, "r")
	result = inp.read()
	inp.close()

	sys.stdout.write(result)


def run_daemon():
	from awscli.clidriver import CLIOperationCaller, LOG, create_clidriver, HISTORY_RECORDER

	def patchedInit(self, session):
		self._session = session
		self._client = None

	def patchedInvoke(self, service_name, operation_name, parameters, parsed_globals):
		if self._client is None:
			LOG.debug("Creating new %s client" % service_name)
			self._client = self._session.create_client(
				service_name, region_name=parsed_globals.region,
				endpoint_url=parsed_globals.endpoint_url,
				verify=parsed_globals.verify_ssl)
		client = self._client

		response = self._make_client_call(
			client, operation_name, parameters, parsed_globals)
		self._display_response(operation_name, response, parsed_globals)
		return 0

	CLIOperationCaller.__init__ = patchedInit
	CLIOperationCaller.invoke = patchedInvoke

	driver = create_clidriver()
	while True:
		inp = open(rd, "r")
		args = inp.read()[:-1].split(" ")[1:]
		inp.close()

		if len(args) > 0 and args[0] == "exit":
			sys.exit(0)

		sys.stdout = open(wr, "w")
		rc = driver.main(args)

		HISTORY_RECORDER.record('CLI_RC', rc, 'CLI')
		sys.stdout.close()


if __name__ == "__main__":
	if not os.access(rd, os.R_OK | os.W_OK):
		os.mkfifo(rd)
	if not os.access(wr, os.R_OK | os.W_OK):
		os.mkfifo(wr)

	# fork if awsr daemon is not already running
	ps = psutil.process_iter(attrs=["cmdline"])
	procs = 0
	for p in ps:
		cmd = p.info["cmdline"]
		if len(cmd) > 1 and cmd[0].endswith("python") and cmd[1] == sys.argv[0]:
			procs += 1
	if procs < 2:
		sys.stderr.write("Forking new awsr background process\n")
		with open(os.devnull, 'r+b', 0) as DEVNULL:
			# new instance will see env var, and run itself as daemon
			p = subprocess.Popen(sys.argv, stdin=DEVNULL, stdout=DEVNULL, stderr=DEVNULL, close_fds=True, env={"AWSR_DAEMON": "True"})
			run_client()

	elif os.environ.get("AWSR_DAEMON") == "True":
		run_daemon()
	else:
		run_client()

Yep, just 89 lines of rather primitive code - of course it's also on GitHub, in case you were wondering.

Some statistics - if you're still not buying it

"Lies, damn lies and statistics", they say. But sometimes, statistics can do wonders when you are trying to prove a point.

As you would understand, our new REPL really shines when there are more and more individual invocations (API calls); so that's what we would compare.

S3 API: s3api, not s3

Let's upload some files (via put-object):

date

for file in $(find -type f -name "*.sha1"); do
    aws s3api put-object --acl public-read --body $file --bucket target.bucket.name --key base/path/
done

date
  • Bucket region: us-east-1
  • File type: fixed-length checksums
  • File size: 40 bytes each
  • Additional: public-read ACL

Uploading 70 such files via aws s3api put-object takes:

  • 4 minutes 35 seconds
  • 473.5 KB data (319.5 KB downlink + 154 KB uplink)
  • 70 DNS lookups + SSL handshakes (one for each file)

In comparison, uploading 72 files via awsr s3api put-object takes:

  • 1 minute 28 seconds
  • 115.5 KB data (43.5 KB downlink + 72 KB uplink)
  • 1 DNS lookup + SSL handshake for the whole operation

A 320% improvement on latency (or 420%, if you consider bandwidth).

If you feel like it, watch the outputs (stdout) of the two runs - real-time. You would notice how awsr shows a low and consistent latency from the second output onwards; while the plain aws shows almost the same latency between every output pair - apparently because almost everything gets re-initialized for each call.

If you monitor (say, "wireshark") your network interface, you will see the real deal: aws continuously makes DNS queries and SSL handshakes, while awsr just makes one every minute or so.

Counterargument #1: If your files are all in one place or directory hierarchy, you could just use aws s3 cp or aws s3 sync in one go. These will be as performant as awsr, if not more. However in my case, I wanted to pick 'n' choose only a subset of files in the hierarchy; and there was no easy way of doing that with the aws command alone.

Counterargument #2: If you want to upload to multiple buckets, you will have to batch up the calls bucket-wise (us-east-1 first, ap-southeast-2 next, etc.); and kill awsr after each batch - more on that later.

CloudWatch logs

Our serverless IDE Sigma generates quite a lot of CloudWatch Logs - especially when our QA battalion is testing it. To keep things tidy, I prefer to occasionally clean up these logs, via aws logs delete-log-group.

date

for i in $(aws logs describe-log-groups --query 'logGroups[*].logGroupName' --output text); do
    echo $i
    aws logs delete-log-group --log-group-name $i
done

date

Cleaning up 172 such log groups on us-east-1, via plain aws, takes:

  • 5 minutes 44 seconds
  • 1.51 MB bandwidth (1133 KB downlink, 381 KB uplink)
  • 173 (1 + 172) DNS lookups + SSL handshakes; one for each log group, plus one for the initial listing

On the contrary, deleting 252 groups via our new REPL awsr, takes just:

  • 2 minutes 41 seconds
  • 382 KB bandwidth (177 KB downlink, 205 KB uplink)
  • 4 DNS lookups + SSL handshakes (about 1 in each 60 seconds)

This time, a 310% improvement on latency; or 580% on bandwidth.

CloudWatch metrics

I use this script to occasionally check the sizes of our S3 buckets - to track down and remove any garbage; playing the "scavenger" role:

Okay, maybe not that much, but... (CartoonStock)

for bucket in `awsr s3api list-buckets --query 'Buckets[*].Name' --output text`; do
    size=$(awsr cloudwatch get-metric-statistics --namespace AWS/S3 \
        --start-time $(date -d @$((($(date +%s)-86400))) +%F)T00:00:00 --end-time $(date +%F)T00:00:00 \
        --period 86400 --metric-name BucketSizeBytes \
        --dimensions Name=StorageType,Value=StandardStorage Name=BucketName,Value=$bucket \
        --statistics Average --output text --query 'Datapoints[0].Average')
    if [ $size = "None" ]; then size=0; fi
    printf "%8.3f  %s\n" $(echo $size/1048576 | bc -l) $bucket
done

Checking 45 buckets via aws (45+1 API calls to the same CloudWatch API endpoint), takes:

94 seconds

Checking 61 buckets (62 API calls) via awsr, takes:

44 seconds

A 288% improvement.

The catch

There are many; more unknowns than knowns, in fact:

  • The REPL depends on serial communication via pipes; so you cannot run things in parallel - e.g. invoke several commands and wait for all of them to complete. (This, however, should not affect any internal parallelizations of aws-cli itself.)
  • awsr may start acting up, if you cancel or terminate an already running command - also a side-effect of using pipes.
  • awsr reuses internal client objects across invocations (sessions), so it is, let's say, "sticky"; it "remembers" - and does not allow you to override - the profile, region etc. across invocations. In order to start working with a new configuration, you should:
    • terminate the existing daemon:
      kill $(ps -ef -C /usr/bin/python | grep -v grep | grep awsr | awk '{print $2}')
    • in case the daemon might have been processing a command when it was brutally massacred; delete the pipes /tmp/awsr_rd and /tmp/awsr_wr
    • run a new awsr with the correct profile (--profile), region (--region) etc.
  • awsr cannot produce interactive output - at least not yet - as it simply reads/writes from/to each pipe exactly once in a single invocation. So commands like ec2 wait and cloudformation deploy will not work as you expected.
  • Currently the pipes only capture standard input and standard output; so, unless you initially launched awsr in the current console/tty, you won't be seeing any error messages (written to standard error) being generated by the underlying AWS API call/command.
  • Some extensions like s3 don't seem to benefit from the caching - even when invoked against the same bucket. It needs further investigation. (Luckily, s3api works fine - as we saw earlier.)

Bonus: hands-on AWS CLI fast automation example, FTW!

I run this occasionally to clean up our AWS accounts of old logs and build data. If you are curious, replace the awsr occurrences with aws (and remove the daemon-killing magic), and witness the difference in speed!

Caution: If there are ongoing CodeBuild builds, the last step may keep on looping - possibly even indefinitely, if the build is stuck in BUILD_IN_PROGRESS status. If you run this from a fully automated context, you may need to enhance the script to handle such cases as well.

for p in araProfile meProfile podiProfile thadiProfile ; do
    for r in us-east-1 us-east-2 us-west-1 us-west-2 ca-central-1 eu-west-1 eu-west-2 eu-central-1 \
        ap-northeast-1 ap-northeast-2 ap-southeast-1 ap-southeast-2 sa-east-1 ap-south-1 ; do

        # profile and region changed, so kill any existing daemon before starting
        arg="--profile $p --region $r"
        kill $(ps -ef -C /usr/bin/python | grep -v grep | grep awsr | awk '{print $2}')
        rm /tmp/awsr_rd /tmp/awsr_wr

        # log groups
        for i in $(awsr $arg logs describe-log-groups --query 'logGroups[*].logGroupName' --output text); do
            echo $i
            awsr $arg logs delete-log-group --log-group-name $i
        done

        # CodeBuild projects
        for i in $(awsr $arg codebuild list-projects --query 'projects[*]' --output text); do
            echo $i
            awsr $arg codebuild delete-project --name $i
        done

        # CodeBuild builds; strangely these don't get deleted when we delete the parent project...
        while true; do
            builds=$(awsr $arg codebuild list-builds --query 'ids[*]' --output text --no-paginate)
            if [[ $builds = "" ]]; then break; fi
            awsr $arg codebuild batch-delete-builds --ids $builds
        done

    done
done

Automation FTW! (CartoonStock)

In closing: so, there it is!

Feel free to install and try out awsr; after all there's just one file, with less than a hundred lines of code!

Although I cannot make any guarantees, I'll try to eventually hunt down and fix the gaping holes and shortcomings; and any other issues that you or me come across along the way.

Over to you, soldier/beta user!

Five quick tips to tame your S3 bill on AWS - once and for all!

Yeah, S3 is awesome.

At least till you get the bill.

Oh boy. Is that for me? (DepositPhotos)

But it could remain awesome (I mean, for the next month onwards) if you have it by its reins.

Throughout our existence, as AdroitLogic - hosting all our integration software distros, client bundles and temporary customer uploads - as well as SLAppForge - being an integral part of hosting the world's first-ever "serverless" serverless IDE (yeah, you read that right) and being one of its most popular services by developers' choice - we have come across quite a few "oops" and "aha" moments with S3.

And I'll try to head-dump a few, right here.

Okay, what does it bill for?

No point in discussing anything, without a clear idea of that one.

AWS S3 (Braze Marketing)

For most use cases, S3 bills you for:

  • storage; e.g. $0.03 per a GB-month (each gigabyte retained throughout the last month) in us-east-1 region
  • data transfer (into and out of the original storage region)
  • API calls: listing, writes, reads, etc.

(Well, it's not as simple as that, but those 3 would cover the 80%.)

Glacier is often good enough.

Unless you have read the docs, you might not be aware that S3 was (and is) a key-value store. Meaning that it was designed to provide fast reads as well as writes for frequently-accessed content.

Coincidentally (or perhaps not?), S3's hierarchical storage structure ended up crazily similar to a filesystem, so people (hehe, like ourselves) started using it as file storage - hosting websites, build artifacts, software distributions, and the like.

And backups, of all sorts.

In reality, content like backups are not frequently accessed - in fact never accessed in most cases; and even less frequently (ideally never) updated.

Moving these backups to the low-cost alternative Amazon Glacier could save you a considerable amount of $$. Maybe right from the start, or maybe once they become obsolete (if you cannot make up your mind to delete them).

AWS Glacier (CloudBerry)

Once "Glaciered", you can temporarily "promote" or "restore" your stuff back to S3 Standard Storage if you happen to need to access them later; however, don't forget that this restoration itself could take a few hours.

And remember that there's no going back - once a glacier, always a glacier! The only way you could get that thing out of Glacier is by deleting it.

Even so, Glacier is still worth it.

CloudFront

S3 is also a popular choice for hosting static websites (another side effect of that FS-like hierarchy). In such cases, you can reduce the overhead on S3, enhance your site loading speed, and shave off a good portion of your bill, by adding a CloudFront distribution in front.

CloudFront as a CDN has many wheels and knobs that can provide quite a bit of advanced website hosting customizations. It integrates with other origins like EC2, MediaStore/MediaPackage, and third-party HTTP servers as well; and you can get a free AWS-issued SSL cert too, via the AWS Certificate Manager!

AWS CloudFront (CloudAcademy)

Yuck, that bucket stinks.

S3's role as a filestore often means that you eventually end up with gigabytes of stale content. Stuff that you once needed but no longer do: temporary uploads, old customer bundles, backed-up website content, diagnostics of previous deployments, and the like.

Lifecycle policies: to the rescue!

Luckily, if you know your content good enough (what you need and for how long you need it), you can simply set up a lifecycle policy to delete or archive these content as soon as they become obsolete.

For example,

A policy that would delete all content older than 30 days (from the date of creation) under the path /logs of a bucket, would look like:

{
  "Status": "Enabled",
  "Expiration": {
    "Days": 1
  },
  "Prefix": "logs"
}

Another policy could archive the files that are older than 90 days, under /archived subpath, to Glacier (or RRS - Reduced Redundancy Storage), instead of deleting them:

{
  "Status": "Enabled",
  "Transitions": [
    {
      "Days": 90,
      "StorageClass": "GLACIER"
    }
  ],
  "Prefix": "archived"
}

With the latter, for example, you can check your bucket manually and move any "no-longer-used-but-should-not-be-deleted-either" stuff under /archived; and they will be automatically archived after 90 days. If you ever want to reuse something from /archived, just move it outside in order to remove it from the archival schedule.

Free (as in beer) bulk operations!

Lifecycle policies don't cost you anything extra - except for the actual S3 operations, same as when using the API. Just set one up, sit back and relax while S3 does the work for you. Handy when you have a huge bucket or subpath to move or get rid of, but doing it via the S3 console or CLI is going to take ages.

In fact this happened to us once, while we were issuing Free demo AWS keys (yay!) for trying out our serverless IDE Sigma. Some idiot had set up request logging for a certain S3 bucket; and by the time we found out, the destination (logs) bucket had 3+ GB of logs, amounting to millions of records. Any attempt to delete the bucket via the S3 console or CLI would drag on for hours; as S3 requires you to delete every single object before the bucket can be removed. So I just set up a lifecycle config with an empty Prefix, to get rid of everything; free of charge (DELETE API calls are free; thanks Amazon!) and no need to keep my computer running for hours on end!

But wait, there's more to that story...

Watch out! That bucket is versioned!

Bucket versioning (AWS docs)

I checked that logs bucket a month later (when I noticed our S3 bill was the same as the last month); and guess what - all that garbage was still sitting there!

Investigating further, we found that our idiot had also enabled versioning for the bucket, so deletion was just an illusion. Under versioning, S3 just puts up a delete marker without really deleting the object content; obviously so that the deleted version can be recovered later. And you keep on getting charged for storage, just like before.

And things just got more complicated!

By now, all log records in that darn bucket had delete markers, so manual deletion was even more difficult; you need to retrieve the version ID for each delete marker and explicitly ask S3 to delete each of them as well!

Luckily, S3 lifecycle rules offer a way to discard expired delete markers - so setting up a new lifecycle config would do the trick (free of charge again):

So remember: if your bucket is versioned, be aware that whatever you do, the original version would be left behind; and only a DeleteObject API call with VersionId can help you get rid of that version - or a lifecycle rule, if there are too many of them.

Checking your buckets without hurting your bill

S3 is quite dumb - no offence meant - in some respects; or am I stepping on AWS's clever and obscure billing tactics?

Unless you're hosting lots and lots of frequently-accessed content - such as customer-facing stuff (software bundles, downloadable documents etc.), your S3 bill will mostly be made of storage charges; as opposed to API calls (reads/writes etc.).

The trick is that, if you try to find the sizes of your S3 buckets, AWS will again charge you for the API calls involved; not to put the blame on them, but it doesn't make much sense - having to spend more in order to assess your spending?

Schrodinger's Cat - go read up your Physics/Chemistry 101!

ListBucket: the saviour/culprit

Given that the ListObjects API is limited to 1000 entries per run, this "size-checking" cost can be significant for buckets with several million entries - not uncommon if, for example, you have request logging or CloudTrail trails enabled.

the s3 ls --summarize of aws-cli, the s3cmd command, and even the Get Size operation of the native console, utilize ListObjects in the background; so you run the risk of spending a few cents - maybe even a few dollars - just for checking the size of your bucket.

The GUI has it, so...

The S3 console has a graphical view of the historical sizes of each bucket (volume and object count). This is good enough for manual checks, but not much useful for several dozen buckets, or when you want to automate things.

It comes from CloudWatch!

S3 automatically reports daily stats (BucketSizeBytes and NumberOfObjects) to CloudWatch Metrics, from where you can query them like so:

aws cloudwatch get-metric-statistics \
    --namespace AWS/S3 \
    --start-time 2019-04-01T00:00:00 \
    --end-time 2019-05-01T00:00:00 \
    --period 86400 \
    --metric-name BucketSizeBytes \
    --dimensions Name=StorageType,Value=StandardStorage Name=BucketName,Value={your-bucket-name-here} \
    --statistics Average

In fact the S3 Console also uses CloudWatch Metrics for those fancy graphs; and now, with the raw data in hand, you can do much, much more!

A megabyte-scale usage summary of all your buckets, perhaps?

for bucket in `aws s3api list-buckets --query 'Buckets[*].Name' --output text`; do
    size=$(aws cloudwatch get-metric-statistics \
        --namespace AWS/S3 \
        --start-time $(date -d @$((($(date +%s)-86400))) +%F)T00:00:00 \
        --end-time $(date +%F)T00:00:00 \
        --period 86400 \
        --metric-name BucketSizeBytes --dimensions Name=StorageType,Value=StandardStorage Name=BucketName,Value=$bucket \
        --statistics Average \
        --output text --query 'Datapoints[0].Average')
    if [ $size = "None" ]; then size=0; fi
    printf "%8.3f  %s\n" $(echo $size/1048576 | bc -l) $bucket
done

AWS CloudWatch (LogicMonitor)

Empty buckets don't report anything, hence the special logic for None.

Data transfer counts!

S3 also charges you for actual data transfer out of the original AWS region where the bucket is located.

Host compute on the same zone

If you intend to read from/write to S3 from within AWS (e.g. EC2 or Lambda), try to keep the S3 bucket and the compute instance in the same AWS region. S3 does not charge for data transfers within the same region, so you can save bandwidth costs big time. (This applies to some other AWS services as well.)

Matchmaking with CloudFront

CloudFront is global, and it has to pull updates from S3 periodically. Co-locating your S3 bucket with the CloudFront region where the most client traffic is observed, could reduce cross-region transfer costs; and result in good savings on both ends.

Gzip, and others from the bag of tricks

User-agents (fancy word for browsers and other HTTP clients) accessing your content are usually aware of content encoding. You can gain good savings as well as speedup and performance improvements, by compressing your S3 artifacts (if not already compressed).

When uploading compressed content, remember to specify the Content-Encoding (via the --content-encoding flag) so that the caller will know it is compressed:

aws s3 cp --acl public-read --content-encoding gzip {your-local-file} s3://{your-s3-bucket}/upload/path

If you're using CloudFront or another CDN in front of S3, it will be able to compress some content for you; however, generally only the well-known ones (.js, .css etc.). A good example is .js.map and .css.map files, commonly found in debug-mode webapps; while CloudFront would compress the .js it would serve the .js.map file uncompressed. By storing the .js.map file compressed on S3 side, storage and transfer overhead will drop dramatically; often by 5-10 times.

gzip FTW! (ICanHasCheezeBurger)

In closing

In the end, it all boils down to identifying hot spots in your S3 bill - whether storage, API calls or transfers, and on which buckets. Once you have the facts, you can employ your S3-fu to chop down the bill.

And the results would often be more impressive than you thought.