Sunday, December 16, 2018

Change the tempo of your audio track with WaveShop

If you ever want to edit audio on Windows, the free GPL-licensed WaveShop would come up top on the list.

However, if you are trying to edit the tempo of your audio track using WaveShop, you would quickly notice that there is no such option (effect or plugin).

But it is possible (at least for MP3s) in a somewhat obscure way: changing the sampling rate.

I'm not an expert on acoustics, music or MP3; but a quick googling shows that the sampling rate is the number of samples of audio carried per second.

Thus, if you reduce your sampling rate to 80% (say 44100Hz → 35280Hz), it would make your MP3 player think that the samples in your file were taken at 28.34-microsecond intervals instead of 22.67 - a 25% "stretch".

So your track will actually play 20% slower, taking 2 min 5 sec if it was originally 1 min 40 sec. Exactly what we need!

To use WaveShop to change your audio track's tempo in this manner:

  • open your track in WaveShop.
  • go to Audio → Format....
  • based on the tempo change you need (say x%), enter the correct Sample Rate value; note that if you want to slow down, x would be negative!

current sampling rate × (x/100 + 1)

  • click OK.

SEO: Add meta descriptions to your AsciiBinder-powered website

Despite the importance of meta descriptions, AsciiBinder native distribution does not add meta description tags to your output web pages.

However, AsciiDoctor (which AsciiBinder uses internally) does return meta tag content as attributes in its parse output.

Patch AsciiBinder to propagate meta descriptions to your template

Apply the following patch on ASCII_BINDER_HOME/lib/ascii_binder/engine.rb to propagate these meta values into the output rendering logic:

--- /var/lib/gems/2.3.0/gems/ascii_binder-0.1.15.1/lib/ascii_binder/engine.rb
+++ /var/lib/gems/2.3.0/gems/ascii_binder-0.1.15.1/lib/ascii_binder/engine.rb
@@ -516,6 +516,13 @@
         :repo_path         => topic.repo_path,
       }

+      description = doc.attr 'description'
+      if description and description.length > 0
+          page_args['description'] = description
+      else
+          log_warn("!!! #{topic.repo_path} has no meta description!")
+      end
       full_file_text = page(page_args)
       File.write(preview_path,full_file_text)
     end

Quick 'n' dirty way to locate engine.rb

An easy way to locate engine.rb is to introduce a syntax error; just type an extra character somewhere in the middle of one of the templating (.erb) scripts in your source project. Now when you run the build, AsciiBinder will print out an error stacktrace containing the locations of each file that it tried to invoke.

...
/var/lib/gems/2.3.0/gems/tilt-2.0.8/lib/tilt/template.rb:274:in `class_eval': /mnt/c/Users/janaka/code/website/_templates/page.html.erb:2: syntax error, unexpected '\n', expecting => (SyntaxError)
...                               ^
        from /var/lib/gems/2.3.0/gems/tilt-2.0.8/lib/tilt/template.rb:274:in `compile_template_method'
...
        from /var/lib/gems/2.3.0/gems/tilt-2.0.8/lib/tilt/template.rb:109:in `render'
        from /var/lib/gems/2.3.0/gems/ascii_binder-0.1.15.1/lib/ascii_binder/template_renderer.rb:20:in `render'
        from /var/lib/gems/2.3.0/gems/ascii_binder-0.1.15.1/lib/ascii_binder/engine.rb:198:in `page'
        from /var/lib/gems/2.3.0/gems/ascii_binder-0.1.15.1/lib/ascii_binder/engine.rb:526:in `configure_and_generate_page'
...
        from /usr/local/bin/asciibinder:23:in `<main>'

The line numbers may differ based on your platform and AsciiBinder version, in which case you would need to manually add the above lines to your AsciiBinder engine.rb; right before the line full_file_text = page(page_args).

Update your template with the meta description text

On the source docs repository, pass-through the received meta content into the template file; e.g. for _templates/page.html.erb:

<%= render(("_templates/_page_custom.html.erb"),
           :distro_key => distro_key,
...
           :description => defined?(description).nil? ? [topic_title, subgroup_title, group_title].compact.join(': ') : description,
...

And use it in your template (_page_custom.html.erb):

...
    <meta charset="utf-8">
    <meta content="IE=edge" http-equiv="X-UA-Compatible">
    <meta content="width=device-width, initial-scale=1.0" name="viewport">
    <meta content="<%= description %>" name="description">
...

Bonus: a warning to never forget your custom meta descriptions in AsciiBinder

If the meta description is missing, the above would default it to a concatenation of the section, subsection and topic names; however it is highly recommended to provide a good meta description via the :description: tag on each AsciiDoc source file.

With the above patch, AsciiBinder will print a warning for each file that does not have a custom meta description:

...
INFO:   - overview/terminology.adoc
WARN: !!! overview/terminology.adoc has no meta description!
INFO:   - overview/faq.adoc
WARN: !!! overview/faq.adoc has no meta description!
...

Feel free to modify the above patch to process any other desired meta tags as well, such as keywords.

Wednesday, November 28, 2018

AWS: Some Tips for Avoiding Those "Holy Bill" Moments

Cloud is awesome: almost-100% availability, near-zero maintenance, pay-as-you-go, and above all, infinitely scalable.

But the last two can easily bite you back, turning that awesomeness into a billing nightmare.

And occasionally you see stories like:

Within a week we accumulated a bill close to $10K.

Holy Bill!

And here I unveil a few tips that we learned from our not-so-smooth journey of building the world's first serverless IDE, that could help others to avoid some "interesting" pitfalls.

Careful with that config!

One thing we learned was to never underestimate the power of a configuration.

If you read the above linked article you would have noticed that it was a simple misconfiguration: a CloudTrail logging config that was writing logs to one of the buckets it was already monitoring.

You could certainly come up with more elaborate and creative examples of creating "service loops" yielding billing black-holes, but the idea is simple: AWS is only as intelligent as the person who configures it.

Infinite loop

(Well, in the above case it was one of my colleagues who configured it, and I was the one who validated it; so you can stop here if you feel like it ;) )

So, when you're about to submit a new config update, try to rethink the consequences. You won't regret it.

It's S3, not your attic.

AWS has estimated that 7% of cloud billing is wasted on "unused" storage - space taken up by content of no practical use: obsolete bundles, temporary uploads, old hostings, and the like.

Life in a bucket

However, it is true that cleaning up things is easier said than done. It is way too easy to forget about an abandoned file than to keep it tracked and delete it when the time comes.

Probably for the same reason, S3 has provided lifecycle configurations - time-based automated cleanup scheduling. You can simply say "delete this if it is older than 7 days", and it will be gone in 7 days.

This is an ideal way to keep temporary storage (build artifacts, one-time shares etc.) in check, hands-free.

Like the daily garbage truck.

Lifecycle configs can also become handy when you want to delete a huge volume of files from your bucket; rather than deleting individual files (which in itself would incur API costs - while deletes are free, listing is not!), you can simply set up a lifecycle config rule to expire everything in 1 day. Sit back and relax, while S3 does the job for you!

{
    "Rules": [
        {
            "Status": "Enabled",
            "Prefix": "",
            "Expiration": {
                "Days": 1
            }
        }
    ]
}

Alternatively you can move the no-longer-needed-but-not-quite-ready-to-let-go stuff into Glacier, for a fraction of the storage cost; say, for stuff under the subpath archived:

{
    "Rules": [
        {
            "Filter": {
                "Prefix": "archived"
            },
            "Status": "Enabled",
            "Transitions": [
                {
                    "Days": 1,
                    "StorageClass": "GLACIER"
                }
            ]
        }
    ]
}

But before you do that...

Ouch, it's versioned!

(Inspired by true events.)

I put up a lifecycle config to delete about 3GB of bucket access logs (millions of files, obviously), and thought everything was good - until, a month later, I got the same S3 bill as the previous month :(

Turns out that the bucket had had versioning enabled, so deletion does not really delete the object.

So with versioning enabled, you need to explicitly tell the S3 lifecycle logic to:

in order to completely get rid of the "deleted" content and the associated delete markers.

So much for "simple" storage service ;)

CloudWatch is your pal

Whenever you want to find out the total sizes occupied by your buckets, just iterate through your AWS/S3 CloudWatch Metrics namespace. There's no way—suprise, surprise—to check bucket size natively from S3; even the S3 dashboard relies on CloudWatch, so why not you?

Quick snippet to view everything? (uses aws-cli and bc on bash)

yesterday=$(date -d @$((($(date +%s)-86400))) +%F)
for bucket in `aws s3api list-buckets --query 'Buckets[*].Name' --output text`; do
        size=$(aws cloudwatch get-metric-statistics --namespace AWS/S3 --start-time ${yesterday}T00:00:00 --end-time $(date +%F)T00:00:00 --period 86400 --metric-name BucketSizeBytes --dimensions Name=StorageType,Value=StandardStorage Name=BucketName,Value=$bucket --statistics Average --output text --query 'Datapoints[0].Average')
        if [ $size = "None" ]; then size=0; fi
        printf "%8.3f  %s\n" $(echo $size/1048576 | bc -l) $bucket
done

EC2: sweep the garbage, plug the holes

EC2 makes it trivial to manage your virtual machines - compute, storage and networking. However, its simplicity also means that it can leave a trail of unnoticed garbage and billing leaks.

EC2

Pick your instance type

There's a plethora of settings when creating a new instance. Unless there are specific performance requirements, picking a T2-class instance type with Elastic Block Store (EBS)-backed storage and 2-4 GB of RAM would suffice for most needs.

Despite being free tier-eligible, t2.micro can be a PITA if your server could receive compute-or memory-intensive loads at some point; in these cases t2.micro tends to simply freeze (probably has to do with running out of CPU credits?), causing more trouble than it's worth.

Clean up AMIs and snapshots

We habitually tend to take periodic snapshots of our EC2 instances as backups. Some of these are made into Machine Images (AMIs) for reuse or sharing with other AWS users.

We easily forget about the other snapshots.

While snapshots don't get billed for their full volume sizes, they can add up to significant garbage over time. So it is important to periodically visit and clean up your EC2 snapshots tab.

Moreover, creating new AMIs would usually mean that older ones become obsolete; they can be "deregistered" from the AMIs tab as well.

But...

Who's the culprit - AMI or snapshot?

The actual charges are on snapshots, not on AMIs themselves.

And it gets tricky because deregistering an AMI does not automatically delete the corresponding snapshot.

You usually have to copy the AMI ID, go to snapshots, look for the ID in the description field, and nuke the matching snapshot. Or, if you are brave (and lazy), select and delete all snapshots; AWS will prevent you from deleting the ones that are being used by an AMI.

Likewise, for instances and volumes

Compute is billed while an EC2 instance is running; but its storage volume is billed all the time - right up to deletion.

Volumes usually get nuked when you terminate an instance; however, if you've played around with volume attachment settings, there's a chance that detached volumes are left behind in your account. Although not attached to an instance, these still occupy space; and so AWS charges for them.

Again, simply go to the volumes tab, select the volumes in "available" state, and hit delete to get rid of them for good.

Tag your EC2 stuff: instances, volumes, snapshots, AMIs and whatnot

Tag 'em

It's very easy to forget what state was in the instance, at the time that snapshot was made. Or the purpose of that running/stopped instance which nobody seems to take ownership or responsibility of.

Naming and tagging can help avoid unpleasant surprises ("Why on earth did you delete that last month's prod snapshot?!"); and also help you quickly decide what to toss ("We already have an 11-05 master snapshot, so just delete everything older than that").

You stop using, and we start billing!

Sometimes, the AWS Lords work in mysterious ways.

For example, Elastic IP Addresses (EIPs) are free as long as they are attached to a running instance. But they start getting charged by the hour, as soon as the instance is stopped; or if they get into a "detached" state (not attached to a running instance) in some way.

Some prior knowledge about the service you're about to sign up for, can prevent some nasty surprises of this fashion. A quick pricing page lookup or google can be a deal-breaker.

Pay-per-use vs pay-per-allocation

Many AWS services follow one or both of the above patterns. The former is trivial (you simply pay for the time/resources you actually use, and enjoy a zero bill for the rest of the time) and hard to miss; but the latter can be a bit obscure and quite easily go unnoticed.

Consider EC2: you mainly pay for instance runtime but you also pay for the storage (volumes, snapshots, AMIs) and network allocations (like inactive Elastic IPs) even if your instance has been stopped for months.

There are many more examples, especially in the serverless domain (which we ourselves are incidentally more familiar with):

Each block adds a bit more to your cost.

Meanwhile, some services secretly set up their own monitoring, backup and other "utility" entities. These, although (probably!) meant to do good, can secretly seep into your bill:

These are the main culprits that often appear in our AWS bills; certainly there are better examples, but you get the point.

CloudWatch (yeah, again)

Many services already—or can be configured to—report usage metrics to CloudWatch. Hence, with some domain knowledge of which metric maps into which billing component (e.g. S3 storage cost is represented by the summation of the BucketSizeBytes metric across all entries of the AWS/S3 namespace), you can build a complete billing and monitoring solution around CloudWatch Metrics (or delegate the job to a third-party service like DataDog).

CloudWatch

CloudWatch in itself is mostly free, and its metrics have automatic summarization mechanisms so you don't have to worry about overwhelming it with age-old garbage—or getting overwhelmed with off-the-limit capacity bills.

The Billing API

Although AWS does have a dedicated Billing Dashboard, logging in and checking it every single day is not something you would add to your agenda (at least not for API/CLI minds like you and me).

Luckily, AWS offers a billing API whereby you can obtain a fairly granular view of your current outstanding bill, over any preferred time period - broken down by services or actual API operations.

Catch is, this API is not free: each invocation costs you $0.01. Of course this is negligible - considering the risk of having to pay several dozens—or even hundreds or thousands in some cases—it is worth having a $0.30/month billing monitor to track down any anomalies before it's too late.

Food for thought: with support for headless Chrome offered for Google Cloud Functions, one might be able to set up a serverless workflow that logs into the AWS dashboard and checks the bill for you. Something to try out during free time (if some ingenious folk hasn't hacked it together already).

Billing alerts

Strangely (or perhaps not ;)) AWS doesn't offer a way to put up a hard limit for billing; despite the numerous user requests and disturbing incident reports all over the web. Instead, they offer alerts for various billing "levels"; you can subscribe for notifications like "bill at x% of the limit" and "limit exceeded", via email or SNS (handy for automation via Lambda!).

My advice: this is a must-have for every AWS account. If we had one in place, we could already have saved well over thousands of dollars to date.

Credit cards

Organizational accounts

If you want to delegate AWS access to third parties (testing teams, contract-basis devs, demo users etc.), it might be a good idea to create a sub-account by converting your root account into an AWS organization with consolidated billing enabled.

(While it is possible to do almost the same using an IAM user, it will not provide resource isolation; everything would be stuffed in the same account, and painstakingly complex IAM policies may be required to isolate entities across users.)

Our CEO and colleague Asankha has written about this quite comprehensively so I'm gonna stop at that.

And finally: Monitor. Monitor. Monitor.

No need to emphasize on this - my endless ramblings should already have conveyed its importance.

So, good luck with that!

A Few Additions to Your Bag of Maven-Fu

Apache Maven

Apache Maven is simple, yet quite powerful; with a few tricks here and there, you can greatly simplify and optimize your dev experience.

Working on multiple, non-colocated modules

Say you have two utility modules foo and bar from one master project A, and another project B which pulls in foo and bar.

While working on B, you realize that you need some occasional tweaks to be done on foo and bar as well; however, since they are on a different project, you would usually need to

  • switch to A
  • make the change
  • mvn install
  • switch back to B
  • and "refresh" dependencies (if you're on an IDE).

Every time there's a need to make an upstream change.

With Maven, instead you can temporarily "merge" the three pieces with a mock master POM that defines foo, bar and B as child modules:

<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
    <modelVersion>4.0.0</modelVersion>
    <!-- blah blah blah -->

    <modules>
        <module>foo</module>
        <module>bar</module>
        <module>B</module>
    </modules>

What's in it for me?

IDEs (like IntelliJ IDEA) will identify the root as a multi-module project; which means you'll be able to:

  • browse seamlessly from one module to other. No more decompilation, bytecode incompatibilities or source maps; handy when you're searching for usages of some class or method - or refactoring one - across your "composite" project's scope.
  • modify sources and resources of each module on demand, within the same project window. The IDE will automatically recompile the changes and add everything to the runtime classpath; handy for in-IDE testing and debugging with hot reload.
  • version seamlessly. If the three modules are under different VCS roots, IDEs like IDEA will track each repo individually; if you commit one set of changes, each repo will have a new commit reflecting its own part of the change; all with the same message!

Meanwhile, plain Maven will build foo/bar (as required) and B in proper sequence, if the root module is built - exactly what we would have wanted.

Relative paths, FTW!

Even if the modules are scattered all over the filesystem, Maven can resolve them easily via relative paths:

    <modules>
        <module>../foo</module>
        <module>grandpa/papa/bar</module>
        <module>../../../../../media/disk2/repos/top-secret/B</module>
    </modules>

Drop that clean

Perhaps the most used (hence the most misused) Maven command is:

mvn clean install

The de-facto that gets run, right after you make some change to your project.

And, for most scenarios, it's grossly overkill.

From scratch?

The command combines two lifecycle phases - "stages" of a Maven build process. Phases have a definite sequence; so if you request some phase to run, all previous phases in its lifecycle will run before it. But Maven plugins are smart enough to skip their work if they detect that they don't have anything to do; e.g. no compilation will happen when compiled classes are up-to-date.

Now, clean is not part of the default lifecycle; rather it is used to start from scratch by removing the entire target directory. On the other hand, install is almost the end of the line (just before deploy in the default lifecycle).

mvn clean install will run both these phases; and, thanks to clean, everything in between as well.

It's handy when you want to clean up everything, and end up with the latest artifacts installed into your local Maven repo. But in most cases, you don't need all of that.

Besides, install will eventually clutter your local Maven cache; especially if you do frequent snapshots/releases with MB- or GB-sized bundles.

Be lazy; do only what's necessary!

Yawn!

If you updated one source file, and want to propagate it to the target/classes dir:

mvn compile

where Maven will auto-detect any changes - and skip compilation entirely if there are none.

If the change was in a test class or resource:

mvn test-compile

will get it into target/test-classes.

Just to run the tests (which will automatically compile any dirty source/test classes):

mvn test

To get a copy of the final bundle in target:

mvn package

As you might often want to start with a clean slate before doing the packaging:

mvn clean package

Likewise, just specify the end phase; or both start and end goals, if you want to go clean to some extent. You will save a whole lot of time, processing power, and temper.

Meanwhile in production...

If your current build would go into production, just forget most of the above ;)

mvn clean package

While any of the "sub-commands" should theoretically do the same thing, you don't want to take chances ;)

While I use package above, in a sense install could be better as well; because then you'll have a copy of the production artifact in your .m2/repository - could be a lifesaver if you lose the delivered/deployed copy.

More skips...

--no-snapshot-updates

If you have watched closely, a build that involves snapshot dependencies, you'd have noticed it taking several seconds to search for Maven metadata files for the snapshots (and failing in the end with warnings; unless you have a habit of publishing snapshot artifacts to remote).

This is usually useless if you're also builidng the snapshot dependencies locally, so you can disable the metadata check (and snapshot sync attempt) via the --no-snapshot-updates or -nsu parameter.

Of course -o would prevent all remote syncs; but you can't use it if you actually want to pull some of the dependencies, in which case -nsu would help.

You can skip compile!

Like the (in)famous -Dmaven.test.skip - or -DskipTests, you can skip the compilation step (even if there are code changes) via -Dmaven.main.skip. Handy when you just want to run tests, without going through the compilation overhead; if you know the stuff is already compiled, of course. Just like -DskipTests - but the other way around!

(Kudos to this SO post)

Skip, skip, skip.

Continuation: -rf

You might already know that, if a module fails in the middle of a build, you can resume the build from that point via the -rf :module-name parameter.

This parameter also works out of the blue; it's not limited to failure scenarios. If you have 30 modules but you just want to build the last 5, just run with -rf :name-of-26th-module.

Tasty testing tricks

Inheriting tests

Generally Maven artifacts don't include test classes/resources. But there are cases where you want to inherit some base test classes into child modules.

With the test-jar specifier, you can inherit an artifact that only contains test classes and resources:

        <dependency>
            <groupId>com.acme</groupId>
            <artifactId>foo</artifactId>
            <version>3.2.1</version>
            <type>test-jar</type>
            <scope>test</scope>
        </dependency>

The corresponding build configuration on the "depended" module would be like:

    <build>
        <plugins>
            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-jar-plugin</artifactId>
                <executions>
                    <execution>
                        <goals>
                            <goal>test-jar</goal>
                        </goals>
                    </execution>
                </executions>
            </plugin>
        </plugins>
    </build>

One caveat is that transitive test dependencies are not inherited in this process, and have to be manually specified again at each usage of the test JAR. (At least I have't come across a better alternative.)

If you're working on one test case, don't run the whole bunch.

-Dtest=com.acme.my.pkg.Test can single-out your WIP test, so you can save plenty of time.

Depending on your test runner, -Dtest may support wildcard selectors as well.

Of course you can temporarily modify the or array of your test plugin config (e.g. SureFire) to limit the set of runnable tests.

Debuggin' it

Beautiful, but still a bug!

Debug a Maven test?

If your test runner (e.g. SureFire) allows you to customize the command line or JVM args used for the test, you can easily configure the forked JVM to wait for a debugger before the test starts executing:

    <build>
        <pluginManagement>
            <plugins>
                <plugin>
                    <groupId>org.apache.maven.plugins</groupId>
                    <artifactId>maven-surefire-plugin</artifactId>
                    <!-- ... -->
                    <configuration>
                        <argLine>-Xdebug -Xnoagent -Xrunjdwp:transport=dt_socket,server=y,address=8000</argLine>

Debug Maven itself?!

If you're writing or troubleshooting a Maven plugin or extension it would be handy to run Maven itself in debug mode.

Maven is ultimately Java, so you can simply grab the ultimate command that gets run when you invoke, and re-run it with the -Xdebug... params.

But Maven already has a way cooler mvnDebug command that does this automatically for you. It has the same syntax as mvn so is pretty easy to get used to.

Once invoked, it will by default listen on port 8000 for a debugger, and start executing as soon as one gets attached; and stop at breakpoints, reveal internal states, allow expression evaluations, etc.

Look at the logs!!!

This deserves its own section, because we are very good at ignoring things - right in front of our eyes.

Right at the start

I bet there's 95% chance that Maven will be spewing off at least one [WARNING] at the start of your build. While we almost always ignore or "postpone" them, they will bite back at some point in the future.

Right before the end

If there's a compile, test or other failure, Maven will try to help by dumping the context (stacktraces, [ERROR]s etc.). Sometimes you'd need to scroll back a page or two to find the actual content, so don't give up and smack your computer in the face, at the first attempt itself.

Recently I spent almost an hour trying to figure out why a -rf :'d build was failing; while the same thing was succeeding when starting from scratch. In the end it boiled down to two little [WARNING] lines about a systemPath dependency resolution error. Right in front of my eyes, yet so invisible.

Stupid me.

Desperate times, -X measures

In some cases, where the standard Maven output is incapable of pinpointing the issue, running Maven in trace mode (-X) is the best course of action. While its output can be daunting, it includes everything Maven (and you) needs to know during the build; plugin versions, compile/test dependency trees, external command invocations (e.g. tests); so you can dig deep and find the culprit.

As always, patience is a virtue.

Final words

As with anything else, when using Maven,

  • know what you're doing.
  • know what you really want to do.
  • believe Maven can do it, until proven otherwise ;)

Monday, November 12, 2018

Browse and Fetch Stuff from Remote Zip Files - Without Downloading!

Before you begin: aatishnn has a way better implementation here, for HTTP-hosted zipfiles; continue reading if you are interested in knowing the internals, or check out my version if you are interested in checking out files on non-HTTP sources (AWS S3, local/network filesystems, etc.).

Okay, now don't say I'm a zip-worshipper or something :)

But I have seriously been bothered by my lack of control over remote-hosted archive files (zip, tar, jar, whatever); I simply have to download the whole thing, even if I'm interested in just one tiny file—or worse, even when I just want to verify that the file is actually in there, and nothing more.

Zip it down

Google: "view zip content online" (?)

While there's plenty of services that provide ways to check remote zipfile content, it's not as convenient as running a command on my local terminal - plus, more often than not, there are cases where you don't want your content to be exposed to a third party, especially in case of protected storage buckets.

Appears that I'm not alone.

The Odyssey begins

Recently I thought what the heck, and started looking into possible hacks - to view content or partially extract a zipfile without downloading the whole thing.

This SO post lit a spark of hope, and the wiki page set it ablaze.

The central directory

The zip standard defines a central directory (call it CD) that contains an index (listing) of all files in the archive. Each entry contains a pointer (offset) to the (usually compressed) content of the actual file so you can retrieve only the required file(s) without scanning the whole archive.

Problem is, this (obviously) requires random access on the archive file.

And... the central directory is located at the end of the archive, rather than the beginning.

Grabbin' by its tail

Lucky for us, the HTTP Range header comes to the rescue. With it, one can fetch a specific range of bytes of (Range: bytes={start}-{end}) an HTTP entity - just the thing we need, to fetch the last few bytes of the archive.

The header (or rather the concept, when you are at higher levels of abstraction) is supported by other storage providers as well, such as AWS S3. Some web servers may not support it, but most of the popular/standard ones do; so we're mostly covered on that front as well.

But there's another problem. How can we know how big the central directory is; so we can retrieve exactly the right chunk of bytes, rather than blindly scanning for the beginning-of- and end-of-directory records?

And the beauty is: Phil Katz had thought all this through.

Zip file anatomy

CD, EOCD, WTF?

At the very end of the archive, there's a 22-bit chunk called the end of central directory (I call it EOCD). (Theoretically this could be (22 + n) bytes long, where n is the length of a trailing comment field; but come on, who would bother to add a comment to that obscure, unheard-of thing?) The last few bytes of this chunk contains all you need to know about the central directory!

ls -lR foo.zip/

Our path is now clearly laid out:

  1. get the total length of the archive via a HEAD request (say L);
  2. fetch the last 22 bits of the file, say, via a ranged GET (Range: bytes=(L-21)-L);
  3. read bytes 13-16 of the EOCD (in little endian encoding) to find the CD length (say C);
  4. read bytes 17-20 of the EOCD to find the CD start offset inside the archive (say O);
  5. do another ranged GET (Range: bytes=O-(O+C-1)) to fetch the CD;
  6. append the EOCD to the CD, and pray that the resulting blob will be readable as a zip archive!

Guess what, it works!

Zip, you beauty.

The zip format is designed to be highly resilient:

  1. you can stuff anything at the beginning or middle (before the CD) of the file (which is how self-extracting archives (SFX) work; they have an extractor program (executable chunk) right before the zipfile content);
  2. you can corrupt the actual content of any file inside the archive, and continue to work with the other uncorrupted entries as if nothing happened (as long as the corruption doesn't affect the size of the content);
  3. and now, apparently, you can even get rid of the content section entirely!

Everything is fine, as long as the CD and EOCD records are intact.

In fact, even if either (or both) of them is corrupt, the parser program would still be able to read and recover at least part of the content by scanning for the record start signatures.

So, in our case, the EOCD + CD chunk is all we need in order to read the whole zipfile index!

Enough talking, here's the code:

def fetch(file, start, len):
 global _key
 (bucket, key) = resolve(file)
 end = start + len - 1

 init(bucket, key)
 return _key.get_contents_as_string(headers={"Range": "bytes=%d-%d" % (start, end)})

def head(file):
 global _key
 (bucket, key) = resolve(file)
 init(bucket, key)
 return _key.size

def resolve(file):
 if file.find("s3://") < 0:
  raise ValueError("Provided URL does not point to S3")
 return file[5:].split("/", 1)

def init(bucket, key):
 global _bucket, _key
 if not _bucket:
  # OrdinaryCallingFormat prevents certificate errors on bucket names with dots
  # https://stackoverflow.com/questions/51604689/read-zip-files-from-amazon-s3-using-boto3-and-python#51605244
  _bucket = boto.connect_s3(calling_format=OrdinaryCallingFormat()).get_bucket(bucket)
 if not _key:
  _key = _bucket.get_key(key)

def parse_int(bytes):
 return ord(bytes[0]) + (ord(bytes[1]) << 8) + (ord(bytes[2]) << 16) + (ord(bytes[3]) << 24)



size = head(file)
eocd = fetch(file, size - 22, 22)

cd_start = parse_int(eocd[16:20])
cd_size = parse_int(eocd[12:16])

cd = fetch(file, cd_start, cd_size)
zip = zipfile.ZipFile(io.BytesIO(cd + eocd))

(Extract of the full source, with all bells and whistles)

It's in my favorite Python dialect (2.7; though 3.x shouldn't be far-fetched), topped with boto for S3 access and io for wrapping byte chunks for streaming.

What? S3?? I need HTTP!

If you want to operate on a HTTP-served file instead, you can simply replace the first four methods (plus the two "hacky" global variables) with the following:

def fetch(file, start, len):
 return requests.get(file, headers={"Range": "bytes=%d-%d" % (start, start + len - 1)}).content

def head(file):
 return int(requests.head(file).headers["Content-Length"])

A word of gratitude

Lucky for us, Python's zipfile module is as resilient as the standard zip spec, meaning that handles our fabricated zipfile perfectly, no questions asked.

Mission I: accomplished

Mission::Accomplished

Mission II

Okay, now that we know what's in the zipfile, we need to see how we can grab the stuff we desire.

The ZipInfo object, which we used earlier for listing the file details, already knows the offset of each file in the archive (as it is - more or less - a parsed version of the central directory entry) in the form of the header_offset attribute.

Negative offset?

However, if you check the raw header_offset value for one of the entries you just got, I bet you'll be confused, at least a lil' bit; because the value is negative!

That's because header_offset is relative to the CD's start position (O); so if you add O to header_offset you simply get the absolute file offset, right away!

Offset of what, sire?

Remember I said, that the above logic gives you the file offset?

Well, I lied.

As you might have guessed, it's the offset of the beginning of another metadata record: the local file header.

Now, now, don't make that long face (my, what a long face!) for the local file header is just a few dozen bytes long, and is immediately followed by the actual file content!

Almost there...

The local file header is a simpler version of a CD file entry, and all we need to know is that it could be (22 + n + m) bytes long; where n is the file name length (which we already know, thanks to the CD entry) and m is the length of an "extra field" (which is usually empty).

The header is structured such that

  • name-length (2 bytes),
  • extra-field-length (2 bytes),
  • name (n bytes), and
  • extra-field (m bytes)

appear in sequence, starting at offset 26 on the header. So if we go to the header offset, and step forward by (26 + 2 + 2 + n + m) we'll land right at the beginning of the compressed data!

And... the CD entry (ZipInfo thingy) already gave us the length of the compressed chunk; compress_size!

The Grand Finale

From there onwards, it's just a matter of doing another ranged GET to fetch the compressed content of our file entry of interest, and deflate it if the compress_type flag indicates that it is compressed:

file_head = fetch(file, cd_start + zi.header_offset + 26, 4)
name_len = ord(file_head[0]) + (ord(file_head[1]) << 8)
extra_len = ord(file_head[2]) + (ord(file_head[3]) << 8)

content = fetch(file, cd_start + zi.header_offset + 30 + name_len + extra_len, zi.compress_size)

if zi.compress_type == zipfile.ZIP_DEFLATED:
 print zlib.decompressobj(-15).decompress(content)
else:
 print content

Party time!

Auditing bandwidth usage

So, in effect, we had to fetch:

  • a header-only response for the total length of the file,
  • 22 bytes of EOCD,
  • the CD, which would usually be well below 1MB even for an archive with thousands of entries,
  • 4 + 40-odd bytes of local file header that appears right before the file content (assuming a file name length of 40 bytes), and
  • the actual (usually compressed) file entry

instead of one big fat zipfile, most of which we would have discarded afterwards if it was just one file that we were after :)

Even if you consider the HTTP overhead (usually < 1KB per request), we only spent less than 6KB beyond the size of the central directory and the actual compressed payload combined.

Which can, in 90-95% of the cases, be a significant improvement over fetching the whole thing.

In conclusion

Of course, everything I laboriously explained above might mean nothing to you if you actually need the whole zipfile - or a major part of it.

But when you want to

  • check if that big fat customer artifact (which you uploaded yesterday, but the local copy was overwritten by a later build; damn it!) has all the necessary files;
  • list out the content of a suspicious file that you encountered during a cyberspace hitchhike;
  • ensure that you're getting what you're looking for, right before downloading a third-party software package archive (ZIP, JAR, XPI, WAR, SFX, whatever);
  • grab that teeny weeny 1KB default configs file from that 4GB game installer (you already have it installed but the configs are somehow fudged up);

you would have a one-liner at hand that would just do it, cheap and fast...

...besides, every geek dev loves a little hackery, don't you?

Geek, ain't you?

Serverless Security: Putting it on Autopilot

Ack: This article is a remix of stuff learned from personal experience as well as from multiple other sources on serverless security. I cannot list down or acknowledge all of them here; nevertheless, special thanks should go to The Register, Hacker Noon, PureSec, and the Serverless Status and Serverless (Cron)icle newsletters.

We all love to imagine that our systems are secure. And then...

BREACH!!!

A very common nightmare shared by every developer, sysadmin and, ultimately, CISO.

You'd better inform the boss...

Inevitable?

One basic principle of computer security states that no system can attain absolute security. Just like people: nobody is perfect. Not unless it is fully isolated from the outside; which, by today's standards, is next to impossible - besides, what's the point of having a system that cannot take inputs and provide outputs?

Whatever advanced security precaution you take, attackers will eventually find a way around. Even if you use the most stringent encryption algorithm with the longest possible key size, attackers will eventually brute-force their way through; although it could be time-wise infeasible at present, who can guarantee that a bizaare technical leap would render it possible tomorrow, or the next day?

But it's not the brute-force that you should really be worried about: human errors are way more common, and can have devastating effects on systems security; much more so than a brute-forced passkey. Just have a peek at this story where some guys just walked into the U.S. IRS building and siphoned out millions of dollars, without using a single so-called "hacking" technique.

As long as systems are made and operated by people—who are error-prone by nature—they will never be truly secure.

Remember those old slides from college days?

So, are we doomed?

No.

Ever seen the insides of a ship?

How its hull is divided into compartments—so that one leaking compartment does not cause the whole ship to sink?

People often apply a similar concept in designing software: multiple modules so that one compromised module doesn't bring the whole system down.

A ship's watertight hull compartments

Combined with the principle of least privilege, this means that a component will compromise the least possible degree of security—ideally the attacker will only be able to wreak havoc within the bounds of the module's security scope, never beyond.

Reducing the blast radius of the component, and consequently the attack surface that it exposes for the overall system.

A security sandbox, you could say.

And a pretty good one at that.

PoLP: The Principle of Least Privilege

Never give someone - or something - more freedom than they need.

More formally,

Every module must be able to access only the information and resources that are necessary for its legitimate purpose. - Wikipedia

This way, if the module misbehaves (or is forced to misbehave, by an entity with malicious intent—a hacker, in English), the potential harm it can cause is minimized; without any preventive "action" being taken, and even before the "breach" is identified!

It never gets old

While the principle was initially brought up in the context of legacy systems, it is even more so applicable for "modern" architectures; SOA (well, maybe not so "modern"), microservices, and FaaS (serverless functions, hence serverless security) as well.

The concept is pretty simple: use the underlying access control mechanisms to restrict the permissions available for your "unit of execution"; may it be a simple HTTP server/proxy, web service backend, microservice, container, or serverless function.

Meanwhile, in the land of no servers...

With increased worldwide adoption of serverless technologies, the significance of serverless security, and the value of our PoLP, is becoming more obvious than ever.

Server-less = effort-less

Not having to provision and manage the server (environment) means that serverless devops can proceed at an insanely rapid pace. With CI/CD in place, it's just a matter of code, commit and push; everything would be up and running within minutes, if not seconds. No SSH logins, file uploads, config syncs, service restarts, routing shifts, or any of the other pesky devops chores associated with a traditional deployment.

"Let's fix the permissions later."

Alas, that's a common thing to hear among those "ops-free" devs (like myself). You're in a haste to push the latest updates to staging, and the "easy path" to avoid a plethora of "permission denied" errors is to relax the permissions on your FaaS entity (AWS Lambda, Azure Function, whatever).

Staging will soon migrate to prod. And so will your "over-permissioned" function.

And it will stay there. Far longer than you think. You will eventually shift your traffic to updated versions, leaving behind the old one untouched; in fear of breaking some other dependent component in case you step on it.

And then come the sands of time, covering the old function from everybody's memories.

An obsolete function with unpatched dependencies and possibly flawed logic, having full access to your cloud resources.

A serverless time bomb, if there ever was one.

Waiting for the perfect time... to explode

Yes, blast radius; again!

If we adhere to the least privilege principle, right from the staging deployment, it would greatly reduce the blast radius: by limiting what the function is allowed to do, we automatically limit the "extent of exploitation" upon the rest of the system if its control ever falls into the wrong hands.

Nailing serverless security: on public cloud platforms

These things are easier said than done.

At the moment, among the leaders of public-cloud FaaS technology, only AWS has a sufficiently flexible serverless security model. GCP automatically assigns a default project-level Cloud Platform service account to all its functions in a given project, meaning that all your functions will be in one basket in terms of security and access control. Azure's IAM model looks more promising, but it still lacks the cool stuff like automatic role-based runtime credential assignments available in both AWS and GCP.

AWS has applied its own IAM role-based permissions model for its Lambda functions, granting users the flexibility to define a custom IAM role—with fully customizable permissions—for every single Lambda function if so desired. It has an impressive array of predefined roles that you can extend upon, and has well-defined strategies for scoping permission to resource or principal categories, merging rules that refer to the same set of resources or operations, and so forth.

This whole hierarchy finally boils down to a set of permissions, each of which takes a rather straightforward format:

{
    "Effect": "Allow|Deny",
    "Action": "API operation matcher (pattern), or array of them",
    "Resource": "entity matcher (pattern), or array of them"
}

In English, this simply means:

Allow (or deny) an entity (user, EC2 instance, lambda; whatever) that possesses this permission, to perform the matching API operation(s) against the matching resource(s).

(There are non-mandatory fields Principal and Condition as well, but we'll skip them here for the sake of brevity.)

Okay, okay! Time for some examples.

{
    "Effect": "Allow",
    "Action": "s3:PutObject",
    "Resource": "arn:aws:s3:::my-awesome-bucket/*"
}

This allows the assignee to put an object (s3:PutObject) into the bucket named my-awesome-bucket.

{
    "Effect": "Allow",
    "Action": "s3:PutObject",
    "Resource": "arn:aws:s3:::my-awesome-*"
}

This is similar, but allows the put to be performed on any bucket whose name begins with my-awesome-.

{
    "Effect": "Allow",
    "Action": "s3:*",
    "Resource": "*"
}

This allows the assignee to do any S3 operation (get/put object, delete object, or even delete bucket) against any bucket in its owning AWS account.

And now the silver bullet:

{
    "Effect": "Allow",
    "Action": "*",
    "Resource": "*"
}

Yup, that one allows oneself to do anything on anything in the AWS account.

The silver bullet

Kind of like the AdministratorAccess managed policy.

And if your principal (say, lambda) gets compromised, the attacker effectively has admin access to your AWS account!

A serverless security nightmare. Needless to say.

To be avoided at all cost.

Period.

In that sense, the best option would be a series of permissions of the first kind; ones that are least permissive (most restricrive) and cover a narrow, well-defined scope.

How hard can that be?

The caveat is that you have to do this for every single operation within that computation unit—say lambda. Every single one.

And it gets worse when you need to configure event sources for triggering those units.

Say, for an API Gateway-triggered lambda, where the API Gateway service must be granted permission to invoke your lambda in the scope of a specific APIG endpoint (in CloudFormation syntax):

{
  "Type": "AWS::Lambda::Permission",
  "Properties": {
    "Action": "lambda:InvokeFunction",
    "FunctionName": {
      "Ref": "LambdaFunction"
    },
    "SourceArn": {
      "Fn::Sub": [
        "arn:aws:execute-api:${AWS::Region}:${AWS::AccountId}:${__ApiId__}/*/${__Method__}${__Path__}",
        {
          "__Method__": "POST",
          "__Path__": "/API/resource/path",
          "__ApiId__": {
            "Ref": "RestApi"
          }
        }
      ]
    },
    "Principal": "apigateway.amazonaws.com"
  }
}

Or for a Kinesis stream-powered lambda, in which case things get more complicated: the Lambda function requires access to watch and pull from the stream, while the Kinesis service also needs permission to trigger the lambda:

  "LambdaFunctionExecutionRole": {
    "Type": "AWS::IAM::Role",
    "Properties": {
      "ManagedPolicyArns": [
        "arn:aws:iam::aws:policy/service-role/AWSLambdaBasicExecutionRole"
      ],
      "AssumeRolePolicyDocument": {
        "Version": "2012-10-17",
        "Statement": [
          {
            "Action": [
              "sts:AssumeRole"
            ],
            "Effect": "Allow",
            "Principal": {
              "Service": [
                "lambda.amazonaws.com"
              ]
            }
          }
        ]
      },
      "Policies": [
        {
          "PolicyName": "LambdaPolicy",
          "PolicyDocument": {
            "Statement": [
              {
                "Effect": "Allow",
                "Action": [
                  "kinesis:GetRecords",
                  "kinesis:GetShardIterator",
                  "kinesis:DescribeStream",
                  "kinesis:ListStreams"
                ],
                "Resource": {
                  "Fn::GetAtt": [
                    "KinesisStream",
                    "Arn"
                  ]
                }
              }
            ]
          }
        }
      ]
    }
  },
  "LambdaFunctionKinesisTrigger": {
    "Type": "AWS::Lambda::EventSourceMapping",
    "Properties": {
      "BatchSize": 100,
      "EventSourceArn": {
        "Fn::GetAtt": [
          "KinesisStream",
          "Arn"
        ]
      },
      "StartingPosition": "TRIM_HORIZON",
      "FunctionName": {
        "Ref": "LambdaFunction"
      }
    }
  },
  "KinesisStreamPermission": {
    "Type": "AWS::Lambda::Permission",
    "Properties": {
      "Action": "lambda:InvokeFunction",
      "FunctionName": {
        "Ref": "LambdaFunction"
      },
      "SourceArn": {
        "Fn::GetAtt": [
          "KinesisStream",
          "Arn"
        ]
      },
      "Principal": "kinesis.amazonaws.com"
    }
  }

So you see, with this granularity, comes great power as well as great responsibility. One missing permission—heck, one mistyped letter—and it's 403 AccessDeniedException.

No easy way; you just have to track down every AWS resource triggering or accessed by your function, look up the docs, pull out your hair, and come up with the necessary permissions.

But... but... that's too much work!

Yup, it is. If you're doing it manually.

But who drives manual these days? :)

Fortunately there are quite a few options, if you're already into automating stuff:

serverless-puresec-cli: thanks PureSec!

If you're using the famous Serverless Framework - which means you're already covered on the trigger permissions front - there's the serverless-puresec-cli plugin from Puresec.

Puresec

The plugin can statically analyze your lambda code and generate a least-privilege role. Looks really cool, but the caveat is that you have to run the serverless puresec gen-roles command before every deployment with code changes; I couldn't yet find a way to run it automatically - during serverless deploy, for example. Worse, it just prints the generated roles into stdout; so you have to manually copy-paste it into serverless.yml, or use some other voodoo to actually inject it into the deployment configuration (hopefully things would improve in the future :))

AWS Chalice: from the Gods

If you're a Python fan, Chalice is capable of auto-generating permissions for you, natively. Chalice is awesome in many aspects; super-fast deployments, annotation-driven triggers, little or no configurations to take care of, and so forth.

AWS Chalice

However, despite being a direct hand-down from the AWS gods, it seems to have missed the word "minimal" when it comes to permissions; if you have the code to list the contents of some bucket foo, it will generate permissions for listing content of all buckets in the AWS account ("Resource": "*" instead of "Resource": "arn:aws:s3:::foo/*"), not just the bucket you are interested in. Not cool!

No CLI? go for SLAppForge Sigma

If you're a beginner, or not that fond of CLI tooling, there's Sigma from SLAppForge.

SLAppForge Sigma

Being a fully-fledged browser IDE, Sigma will automatically analyze your code as you compose (type or drag-n-drop) it, and derive the necessary permissions—for the Lambda runtime as well as for the triggers—so you are fully covered. The recently introduced Permission Manager also allows you to modify these auto-generated permissions if you desire; for example, if you are integrating a new AWS service/operation that Sigma doesn't yet know about.

Plus, with Sigma, you never have to worry about any other configurations; resource configs, trigger mappings, entity interrelations and so forth—the IDE takes care of it all.

The caveat is that Sigma only supports NodeJS at the moment; but Python, Java and other cool languages are on their way!

(Feel free to comment below, if you have other cool serverless security policy generation tools in mind! And no, AWS Policy Generator doesn't count.)

In closing

Least privilege principle is crucial for serverless security, and software design in general; sooner or later, it will save your day.

Lambda's highly granular IAM permission model is ideal for the PoLP.

Tools like the Puresec CLI plugin, all-in-one Sigma IDE and AWS Chalice can automate security policy generation; making your life easier, and still keeping the PoLP promise.

Wednesday, July 18, 2018

My bots are now placeless. Homeless. Serverless.

I usually keep an eye on various websites - for latest publications, hot new offers, limited-time games and contests, and the like.

Most of these do not offer a "clean" notification system, such as an RSS feed. So I often have to scrape their HTML to get to what I need.

Which means I often need to run some custom string manipulation magic to get to what I need.

And I need it to be periodic (who knows when the next hot update would surface?).

And automatic (I have more important things to do during my day).

And remotely hosted (I don't want to keep my laptop running 24×7, with an uninterrupted internet connection).

So far I have been relying on Google Apps Script (and more recently, Google App Engine) for driving these sorts of home-made integration "snippets"; however, with the whole world immersing itself in serverless, why shouldn't I?

So I set out to migrate one of my scripts, written for monitoring a Chinese retail website. The site occasionally publishes various discounted offers and seasonal games where I can earn nice coupons and credits via daily plays. But for some reason the site does not send out promotional emails to my email address, which means I have to keep checking the site every once in a while just to make sure that I won't miss something cool.

And you know the drill.

I forget things easily. Sometimes, when I'm away from my computer, I miss the reminder as well. Sometimes I'm just too lazy to look things up, because I end up with nothing new, 75-80% of the time. So many excuses...

Besides, who in their right developer mind wants to do something as boring as that, when you can just set up a bot, sit back, and relax?!

I started off with AWS Lambda, the obvious choice for free serverless computing. Its non-expiring free tier gives me an unbelievable 3.2M (yes, million) seconds of runtime per month - I can virtually keep one lambda running forever, and a little bit more! - across 1M (million again!) invocations. Previously on Apps Script or App Engine I had just 90 minutes per day - a little over 160K seconds per month - meaning that I had to use my quotas very sparingly; but now I can let go of my fears and fully enjoy my freedom of development. Not to mention the fully-fledged container environment in contrast to the framework confinements of Apps Script or App Engine.

Enough talk. Let's code!

Rather than taking the standard path, I picked Sigma from SLAppForge as my development framework; primarily because it had some reputation for supporting external dependencies, and taking care of packaging and deploying stuff on my behalf - including all the external services (APIs, tables, crons and whatnot).

First I had to sign up for Sigma. Although I could have gone ahead with their demo feature (the big yellow button), I already had an AWS account and a GitHub account (not to mention an email address); so why not give it a shot?

The sign-up form

When completed the registration and logged in, I was greeted with a project selection pane, where I opted for a new project with name site-monitor:

Creating 'site-monitor'

The app was blazingly fast, and the editor popped up as soon as I hit Create Project:

The Sigma editor

Without further ado, I grabbed the content of my former Apps Script function and dropped it into Sigma!

let AWS = require('aws-sdk');

exports.handler = function(event, context, callback) {

    // Here Goes Nothing

    PROPS = PropertiesService.getScriptProperties();
    page = UrlFetchApp.fetch("http://www.my-favorite-site.com").getResponseText();
    url = page.match(/(lp|\?r=specialTopic)\/[^"]*/)[0];
    if (url != PROPS.getProperty("latest")) {
        GmailApp.sendEmail("janakaud@gmail.com", "MyFavSite Update!", url);
        PROPS.setProperty("latest", url);
    }

    // end of Here Goes Nothing

    callback(null,'Successfully executed');
}

(I know, I know, that didn't work. Bear with me :))

The next several minutes, I spent transforming my Apps Script code into NodeJS. It was not that hard (both are JS, after all!) once I got the request module added to my project:

'Add Dependency' button

Searching for 'request' dependency

But I must say I did miss the familiar, synchronous syntax of the UrlFetchApp module.

Under App Engine I had the wonderfully simple PropertiesService to serve as the "memory" of my bot. Under Sigma (AWS) things were not that simple; after some look-around I decided to go with DynamoDB (although I still felt it was way much overkill).

Once I have extracted the URL from the page, I needed to check if I have already notified myself of it; the equivalent of querying my table (formerly the PropertiesService) for an existing entry. In DynamoDB-land this was apparently a Get Document operation, so I tried dragging in DynamoDB into my code:

DynamoDB incoming!

Once dropped, the DynamoDB entry transformed into a pop-up where I could define my table and provide the code-level parameters as well. Hopefully Sigma would remember the table configuration so I won't have to enter it again and again, all over my code.

Configuring a new DynamoDB table, and a 'Get Document' operation

Since DynamoDB isn't a simple key-value thingy, I spent a few minutes scratching my head on how to store my "value" in there; eventually I decided to use a "document" structure of the form

{
    "domain": "my-favorite-site.com",
    "url": "{the stored URL value}"
}

where I could query the table using a specific domain value for each bot, and hence reuse the table for different bots.

In my old code I had used a GmailApp.sendEmail() call to send myself a notification when I got something new. In Sigma I tried to do the same by dragging and dropping a Simple Email Service (SES) entry:

SES: verifying a new email

Here there was a small hiccup, as it appeared that I would need to verify an email address before I could send something out. I wasn't sure how bumpy my ride would be, anyway I entered my email address and clicked Send verification email.

SES

Sure enough, I received a verification link via email which, when clicked, redirected me to a "Verification successful" page.

And guess what: when I switched back to Sigma, the popup had updated itself, stating that the email was verified, and guiding me through the next steps!

Email verified; SES popup automagically updated!

I filled in the details right away (To myself, no CC's or BCC's, Subject MyFavSite Update! and Text Body @{url} (their own variable syntax; I wish it were ${} though)):

SES configuration

In the callback of SES email sender, I had to update the DynamoDB table to reflect the new entry that was emailed out (so I won't email it again). Just like the PROPS.setProperty("latest", url) call in my original bot.

That was easy, with the same drag-n-drop thingy: selecting the previously created table under Existing Tables and selecting a Put Document operation with domain set to my-favorite-site.com (my "search query"; equivalent of "latest" in the old bot) and a url entry set to the emailed URL:

DynamoDB Put Document configuration

Eventually I ended up with a fairly good piece of code (although it was way longer than my dear old Apps Script bot):

let AWS = require('aws-sdk');
const ses = new AWS.SES();
const ddb = new AWS.DynamoDB.DocumentClient();
const request = require("request");

exports.handler = function (event, context, callback) {
    request.get("http://www.my-favorite-site.com",
        (error, response, body) => {
            if (!body) {
                throw new Error("Failed to fetch homepage!");
            }

            let urls = page.match(/(lp|\?r=specialTopic)\/[^"]*/);
            if (!urls) { // nothing found; no point in proceeding
                return;
            }
            let url = urls[0];

            ddb.get({
                TableName: 'site-data',
                Key: { 'domain': 'my-favorite-site.com' }
            }, function (err, data) {
                if (err) {
                    throw err;
                } else {
                    if (!data.Item || data.Item.url != url) {
                        ses.sendEmail({
                            Destination: {
                                ToAddresses: ['janakaud@gmail.com'],
                                CcAddresses: [],
                                BccAddresses: []
                            },
                            Message: {
                                Body: {
                                    Text: {
                                        Data: url
                                    }
                                },
                                Subject: {
                                    Data: 'MyFavSite Update!'
                                }
                            },
                            Source: 'janakaud@gmail.com',
                        }, function (err, data) {
                            if (err) {
                                throw err;
                            }
                            ddb.put({
                                TableName: 'site-data',
                                Item: { 'domain': 'my-favorite-site.com', 'url': url }
                            }, function (err, data) {
                                if (err) {
                                    throw err;
                                } else {
                                    console.log("New URL saved successfully!");
                                }
                            });
                        });
                    } else {
                        console.log("URL already sent out; ignoring");
                    }
                }
            });
        });

    callback(null, 'Successfully executed');
}

Sigma was trying to help me all the way, by providing handy editing assistance (code completion, syntax coloring, variable suggestions...), and even highlighting the DynamoDB and SES operations and displaying tiny icons in front; which, when clicked, displayed (re)configuration pop-ups similar to what I got when I drag-dropped them the first time.

DynamoDB operation highlighted, with indicator icon in front

DynamoDB operation edit pop-up

Due to the async, callback-based syntax, I had to move around bits 'n' pieces of my code several times. Sigma handled it pretty well, re-doing the highlighting stuff a second or two after I pasted the code in its new location.

Just for fun, I tried editing the code manually (without using the pop-up) and, fair enough, the pop-up understood the change and updated itself the next time I checked. Pretty neat for a newbie who wants to get stuff done without diving into the docs.

Now, how can I run my bot periodically?

Sigma shows a red lightning sign near the function header, and highlights the event parameter in the same. Possibly indicating it's the point of invocation or triggering of the lambda.

highlighted 'event' variable on function header, with red lightning icon in front

Yup. Their docs say the same.

AWS docs and Sigma's own ones pointed me to CloudWatch scheduled event triggers that could trigger a lambda with a predefined schedule - like Apps Script triggers but much more powerful; more like App Engine cron expressions.

As mentioned in their docs, I dragged a CloudWatch entry on to the event variable and configured it like so:

CloudWatch trigger configuration

And the whole event thing changed from red to green, possibly indicating that my trigger was set up successfully.

Right. Time to test it out.

The toolbar has a Test (play) button, with a drop-down to select your test case. Like Apps Script, but much better in the sense that you can defne the input payload for the invocation (whereas Apps Script just runs the function without any input arguments):

Test button on Sigma toolbar

Test case configuration dialog

As soon as I configured a test case and hit the run button, the status bar started showing a running progress:

Test ongoing!

Few seconds later, a SigmaTrail log output window automagically popped up, and started showing some logs:

SigmaTrail automagically pops up when you run a test!

errorMessage:"RequestId: 87c59aba-8822-11e8-b912-0f46b6510aa8 Process exited before completing request"
[7/15/2018][5:00:52 PM] Updating dependencies. This might make runtime longer than usual.
[7/15/2018][5:00:55 PM] Dependencies updated.
[7/15/2018][5:00:57 PM] ReferenceError: page is not defined
at Request.request.get [as _callback] (/tmp/site-monitor/lambda.js:13:24)
at Request.self.callback (/tmp/site-monitor/node_modules/request/request.js:185:22)

Oops, looks like I got a variable name wrong.

A simple edit, and another test.

[7/15/2018][5:04:50 PM] ResourceNotFoundException: Requested resource not found
at Request.extractError (/tmp/site-monitor/node_modules/aws-sdk/lib/protocol/json.js:48:27)
at Request.callListeners (/tmp/site-monitor/node_modules/aws-sdk/lib/sequential_executor.js:105:20)

Hmm, what does that mean?

Looks like this one's coming from the AWS SDK itself.

Maybe the AWS "resources" I dragged-n-dropped into my app are not yet available on AWS side; besides, many of the Sigma tutorials mention a "deployment" step before they go into testing.

Oh well, let's try deploying this thing.

Deploy button on Sigma toolbar

I was hoping a seamless "one-click deploy", but when I clicked the Deploy button I just got a pop-up saying I need to authenticate to GitHub. Sigma might probably be saving my stuff in a GitHub repo and then using it for the rest of the deployment.

'GitHub Authentication' pop-up

Seeing no evil, I clicked the sign-in, and authorized their app on the pop-up window that followed. Within a few seconds, I got another pop-up asking me to pick a repo name and a commit message.

GitHub commit dialog

I didn't have a repo site-monitor in my account, so I was curious to see what Sigma would do. Just as I suspected, after a few seconds from clicking Commit, another dialog popped-up asking whether I would like it to create a new repo on my behalf.

GitHub repo creation confirmation dialog

Sigma was so kind that it even offered to create a private repository; but alas, I didn't have the luxury, so I just clicked Create Repo and Commit.

From there onwards, things were fairly automated: after the "Successfully commmitted" notification, there was a lightningly fast "build" step (accompanied by a progress bar in the bottom status pane).

Next I got another pop-up, this time a Changes Summary; which, after a few more seconds, populated itself with a kind of "deployment summary":

Deployment summary

I wasn't much interested in the low-level detail (though I did recognize the cweOneAM as my cron trigger and siteMonitorLambda as my bot), so I just hit Execute; and this time there was a fairly long wait (accompanied by another progress bar, this time within the pop-up itself).

Deployment progress

Once it hit the 100% mark, Sigma stated that my deployment completed with a CREATE_COMPLETE state (sounds good!).

Now let's try that testing thing, again.

"Successfully executed"
[7/15/2018][5:39:34 PM] New URL saved successfully!

SigmaTrail: success!

Yay!

Wait, will it resend if I run it again?

"Successfully executed"
[7/15/2018][5:39:41 PM] URL already sent out; ignoring

All good; no duplicates!

Now to check my inbox, to see if Sigma is telling the truth.

Initially I was a bit confused because I didn't actually receive an email; but eventually I found it sitting in my Spam folder (probably because it was sent by a third party (AWS)?), and unmarking it as spam did the trick.

Email received! (confused? I use Gmail Mobile: https://mail.google.com/mail/u/0/x/a)

Hopefully my CloudWatch trigger would fire tomorrow at 1 AM, bringing me the good news if there are any!

All in all, the graphical IDE is quite slick and recommendable to my colleagues. Except for the deployment time (which I guess is characteristic to serverless apps, or Lambda, or perhaps AWS), I felt almost at home - and even more so, with all the nifty features - autocompletion, drag-n-drop, GUI configs, testing, logs, and so forth.

Time for a cuppa coffee, and then to start migrating my other bots to Sigma... um... AWS.