Yeah yeah, it's "highly experimental" and all, but still it's 3 times faster than simply running aws bla bla bla
, the "plain" aws-cli
.
And yes, it won't always be that fast, especially if you only run AWS CLI only about once a fortnight. But it will certainly have a clear impact once you start batching up your AWS CLI calls; maybe routine account checks/cleanups, maybe extracting tons of CloudWatch Metrics records; or maybe a totally different, unheard-of use case.
Whatever it is, I guess it would be useful for the masses some day.
Plus, as one of the authors and maintainers of the world's first serverless IDE, I have certainly had several chances to put it to good use!
The problem: why AWS CLI is "slow" for me
(Let's just call it "CLI", shall we?)
It's actually nothing to do with the CLI itself; rather it's the fact that each CLI invocation is a completely new program execution cycle.
This means:
python
(and ultimately the OS) has to load the binaries, configs,boto
API definitions and so forth;- the CLI has to initialize itself: load all supported commands definitions, prepare parsers, generate API client classes and so forth
But, as usual, the highest impact comes via network I/O:
- the CLI has to create an API client from scratch (the previous one was lost when the command execution completed)
- since the network connection to AWS is managed by the client, this means that each command creates (and then destroys) a fresh TCP connection to the AWS endpoint; which involves a DNS lookup as well (although later lookups may be served from the system cache)
- since AWS APIs almost always use SSL, every new connection results in a full SSL handshake (client hello, server hello, server cert, yada yada yada)
Now, assume you have 20 CloudWatch Log Groups to be deleted. Since the Logs API does not offer a bulk deletion option, the cheapest way to do this would be to run a simple shell script - looping aws logs delete-log-group
over all groups:
for i in $(aws logs describe-log-groups --query 'logGroups[*].logGroupName' --output text); do aws logs delete-log-group --log-group-name $i done
This would run the CLI 20 times (21 to be precise, if you count the initial list API call); meaning that all of the above will run 20 times. Clearly a waste of time and resources, since we were quite clear that the same endpoint was going to be invoked in all those runs.
Try scaling this up to hundreds or thousands of batched operations; and see where it takes you!
And no, aws-shell
does not cut it.
Not yet, at least.
Leaving aside the nice and cozy REPL interface (interactive user prompt), handy autocompletion, syntax coloring and inline docs, aws-shell
does not give you any performance advantage over aws-cli
. Every command in the shell is executed in a new AWS CLI instance - with parsers, command hierarchies, API specs and - more importantly API clients - getting recreated for every command.
Skeptical? Peek at the aws-shell
sources; or better still, fire up Wireshark (or tcpdump
if you dare), run a few commands in the shell REPL, and see how each command initializes a fresh SSL channel from scratch.
The proposal: what can we do?
Obviously, the CLI cannot do pretty much anything about it. It's a simple program, and whatever improvements we do, won't last until the next invocation. The OS would rudely wipe them and start the next CLI with a clean slate; unless we use some spooky (and rather discouraged) memory persistence magic to serialize and reload the CLI's state. Even then, the other OS-level stuff (network sockets etc.) will be gone, and our effort would be pretty much fruitless.
If we are going to make any impactful changes, we need to make the CLI stateful; a long-running process.
The d(a)emon
In the OS world, this usually means setting up a daemon - a background process that waits for and processes events like user commands. (A popular example is MySQL, with its mysql-server
daemon and mysql-client
packages.)
In our case, we don't want a fully-fledged "managed" daemon - like a system service. For example, there's no point in starting our daemon before we actually start making our CLI calls; also, if our daemon dies, there's no point in starting it right away; since we cannot recover the lost state anyway.
So we have a simple plan:
- break the CLI into a "client" and daemon
- every time we run the CLI,
- check for the presence of the daemon, and
- spawn the daemon if it is not already running
This way, if the daemon dies, the next CLI invocation will auto-start it. Nothing to worry, nothing to manage.
Our fast AWS CLI daemon - it's all in a subprocess
!
It is easy to handle the daemon spawn without having the trouble of maintaining a second program or script; simply use subprocess.Popen
to launch another instance of the program, and instruct it to run the daemon's code path, rather than the client's.
Enough talk; show me the code!
Here you go:
#!/usr/bin/python import os import sys import tempfile import psutil import subprocess rd = tempfile.gettempdir() + "/awsr_rd" wr = tempfile.gettempdir() + "/awsr_wr" def run_client(): out = open(rd, "w") out.write(" ".join(sys.argv)) out.write("\n") out.close() inp = open(wr, "r") result = inp.read() inp.close() sys.stdout.write(result) def run_daemon(): from awscli.clidriver import CLIOperationCaller, LOG, create_clidriver, HISTORY_RECORDER def patchedInit(self, session): self._session = session self._client = None def patchedInvoke(self, service_name, operation_name, parameters, parsed_globals): if self._client is None: LOG.debug("Creating new %s client" % service_name) self._client = self._session.create_client( service_name, region_name=parsed_globals.region, endpoint_url=parsed_globals.endpoint_url, verify=parsed_globals.verify_ssl) client = self._client response = self._make_client_call( client, operation_name, parameters, parsed_globals) self._display_response(operation_name, response, parsed_globals) return 0 CLIOperationCaller.__init__ = patchedInit CLIOperationCaller.invoke = patchedInvoke driver = create_clidriver() while True: inp = open(rd, "r") args = inp.read()[:-1].split(" ")[1:] inp.close() if len(args) > 0 and args[0] == "exit": sys.exit(0) sys.stdout = open(wr, "w") rc = driver.main(args) HISTORY_RECORDER.record('CLI_RC', rc, 'CLI') sys.stdout.close() if __name__ == "__main__": if not os.access(rd, os.R_OK | os.W_OK): os.mkfifo(rd) if not os.access(wr, os.R_OK | os.W_OK): os.mkfifo(wr) # fork if awsr daemon is not already running ps = psutil.process_iter(attrs=["cmdline"]) procs = 0 for p in ps: cmd = p.info["cmdline"] if len(cmd) > 1 and cmd[0].endswith("python") and cmd[1] == sys.argv[0]: procs += 1 if procs < 2: sys.stderr.write("Forking new awsr background process\n") with open(os.devnull, 'r+b', 0) as DEVNULL: # new instance will see env var, and run itself as daemon p = subprocess.Popen(sys.argv, stdin=DEVNULL, stdout=DEVNULL, stderr=DEVNULL, close_fds=True, env={"AWSR_DAEMON": "True"}) run_client() elif os.environ.get("AWSR_DAEMON") == "True": run_daemon() else: run_client()
Yep, just 89 lines of rather primitive code - of course it's also on GitHub, in case you were wondering.
Some statistics - if you're still not buying it
"Lies, damn lies and statistics", they say. But sometimes, statistics can do wonders when you are trying to prove a point.
As you would understand, our new REPL really shines when there are more and more individual invocations (API calls); so that's what we would compare.
S3 API: s3api
, not s3
Let's upload some files (via put-object
):
date for file in $(find -type f -name "*.sha1"); do aws s3api put-object --acl public-read --body $file --bucket target.bucket.name --key base/path/ done date
- Bucket region:
us-east-1
- File type: fixed-length checksums
- File size: 40 bytes each
- Additional:
public-read
ACL
Uploading 70 such files via aws s3api put-object
takes:
- 4 minutes 35 seconds
- 473.5 KB data (319.5 KB downlink + 154 KB uplink)
- 70 DNS lookups + SSL handshakes (one for each file)
In comparison, uploading 72 files via awsr s3api put-object
takes:
- 1 minute 28 seconds
- 115.5 KB data (43.5 KB downlink + 72 KB uplink)
- 1 DNS lookup + SSL handshake for the whole operation
A 320% improvement on latency (or 420%, if you consider bandwidth).
If you feel like it, watch the outputs (stdout) of the two runs - real-time. You would notice how awsr
shows a low and consistent latency from the second output onwards; while the plain aws
shows almost the same latency between every output pair - apparently because almost everything gets re-initialized for each call.
If you monitor (say, "wireshark") your network interface, you will see the real deal: aws
continuously makes DNS queries and SSL handshakes, while awsr
just makes one every minute or so.
Counterargument #1: If your files are all in one place or directory hierarchy, you could just use aws s3 cp
or aws s3 sync
in one go. These will be as performant as awsr
, if not more. However in my case, I wanted to pick 'n' choose only a subset of files in the hierarchy; and there was no easy way of doing that with the aws
command alone.
Counterargument #2: If you want to upload to multiple buckets, you will have to batch up the calls bucket-wise (us-east-1
first, ap-southeast-2
next, etc.); and kill awsr
after each batch - more on that later.
CloudWatch logs
Our serverless IDE Sigma generates quite a lot of CloudWatch Logs - especially when our QA battalion is testing it. To keep things tidy, I prefer to occasionally clean up these logs, via aws logs delete-log-group
.
date for i in $(aws logs describe-log-groups --query 'logGroups[*].logGroupName' --output text); do echo $i aws logs delete-log-group --log-group-name $i done date
Cleaning up 172 such log groups on us-east-1
, via plain aws
, takes:
- 5 minutes 44 seconds
- 1.51 MB bandwidth (1133 KB downlink, 381 KB uplink)
- 173 (1 + 172) DNS lookups + SSL handshakes; one for each log group, plus one for the initial listing
On the contrary, deleting 252 groups via our new REPL awsr
, takes just:
- 2 minutes 41 seconds
- 382 KB bandwidth (177 KB downlink, 205 KB uplink)
- 4 DNS lookups + SSL handshakes (about 1 in each 60 seconds)
This time, a 310% improvement on latency; or 580% on bandwidth.
CloudWatch metrics
I use this script to occasionally check the sizes of our S3 buckets - to track down and remove any garbage; playing the "scavenger" role:
for bucket in `awsr s3api list-buckets --query 'Buckets[*].Name' --output text`; do size=$(awsr cloudwatch get-metric-statistics --namespace AWS/S3 \ --start-time $(date -d @$((($(date +%s)-86400))) +%F)T00:00:00 --end-time $(date +%F)T00:00:00 \ --period 86400 --metric-name BucketSizeBytes \ --dimensions Name=StorageType,Value=StandardStorage Name=BucketName,Value=$bucket \ --statistics Average --output text --query 'Datapoints[0].Average') if [ $size = "None" ]; then size=0; fi printf "%8.3f %s\n" $(echo $size/1048576 | bc -l) $bucket done
Checking 45 buckets via aws
(45+1 API calls to the same CloudWatch API endpoint), takes:
94 seconds
Checking 61 buckets (62 API calls) via awsr
, takes:
44 seconds
A 288% improvement.
The catch
There are many; more unknowns than knowns, in fact:
- The REPL depends on serial communication via pipes; so you cannot run things in parallel - e.g. invoke several commands and wait for all of them to complete. (This, however, should not affect any internal parallelizations of
aws-cli
itself.)
awsr
may start acting up, if you cancel or terminate an already running command - also a side-effect of using pipes.
awsr
reuses internal client objects across invocations (sessions), so it is, let's say, "sticky"; it "remembers" - and does not allow you to override - the profile, region etc. across invocations. In order to start working with a new configuration, you should:- terminate the existing daemon:
kill $(ps -ef -C /usr/bin/python | grep -v grep | grep awsr | awk '{print $2}')
- in case the daemon might have been processing a command when it was brutally massacred; delete the pipes
/tmp/awsr_rd
and/tmp/awsr_wr
- run a new
awsr
with the correct profile (--profile
), region (--region
) etc.
- terminate the existing daemon:
awsr
cannot produce interactive output - at least not yet - as it simply reads/writes from/to each pipe exactly once in a single invocation. So commands likeec2 wait
andcloudformation deploy
will not work as you expected.
- Currently the pipes only capture standard input and standard output; so, unless you initially launched
awsr
in the current console/tty, you won't be seeing any error messages (written to standard error) being generated by the underlying AWS API call/command.
- Some extensions like
s3
don't seem to benefit from the caching - even when invoked against the same bucket. It needs further investigation. (Luckily,s3api
works fine - as we saw earlier.)
Bonus: hands-on AWS CLI fast automation example, FTW!
I run this occasionally to clean up our AWS accounts of old logs and build data. If you are curious, replace the awsr
occurrences with aws
(and remove the daemon-killing magic), and witness the difference in speed!
Caution: If there are ongoing CodeBuild builds, the last step may keep on looping - possibly even indefinitely, if the build is stuck in BUILD_IN_PROGRESS
status. If you run this from a fully automated context, you may need to enhance the script to handle such cases as well.
for p in araProfile meProfile podiProfile thadiProfile ; do for r in us-east-1 us-east-2 us-west-1 us-west-2 ca-central-1 eu-west-1 eu-west-2 eu-central-1 \ ap-northeast-1 ap-northeast-2 ap-southeast-1 ap-southeast-2 sa-east-1 ap-south-1 ; do # profile and region changed, so kill any existing daemon before starting arg="--profile $p --region $r" kill $(ps -ef -C /usr/bin/python | grep -v grep | grep awsr | awk '{print $2}') rm /tmp/awsr_rd /tmp/awsr_wr # log groups for i in $(awsr $arg logs describe-log-groups --query 'logGroups[*].logGroupName' --output text); do echo $i awsr $arg logs delete-log-group --log-group-name $i done # CodeBuild projects for i in $(awsr $arg codebuild list-projects --query 'projects[*]' --output text); do echo $i awsr $arg codebuild delete-project --name $i done # CodeBuild builds; strangely these don't get deleted when we delete the parent project... while true; do builds=$(awsr $arg codebuild list-builds --query 'ids[*]' --output text --no-paginate) if [[ $builds = "" ]]; then break; fi awsr $arg codebuild batch-delete-builds --ids $builds done done done
In closing: so, there it is!
Feel free to install and try out awsr
; after all there's just one file, with less than a hundred lines of code!
Although I cannot make any guarantees, I'll try to eventually hunt down and fix the gaping holes and shortcomings; and any other issues that you or me come across along the way.
Over to you, soldier/beta user!
No comments:
Post a Comment