Wednesday, June 24, 2020

File Transfer, MFT and Beyond: The Story of AS2

Originally written for The AS2 Gateway Blog; Jun 17, 2020

This is a work of fiction. The incidents portrayed herein may have taken place under different settings, in a different order, or even not taken place at all; however, it will not refute the origin and evolution of Applicability Statement 2, more commonly known as AS2, the most popular secure B2B document/EDI exchange protocol in the known universe.

The Beginning: File Transfer, the Primordial Soup

Earliest files were paperbacked, and dispatched around via postal and courier services. They took days - or hours at best - to reach their destinations. Needless to say, a major bottleneck in rapidly growing B2B trade ecosystems.

Digital File Transfer

Then came computers and the internet, and things improved dramatically. (There were probably many more intermediates; fax, email, etc.; but we'll skip them for brevity.)

Now the transfers took just minutes - sometimes even seconds - instead of hours or days. Accuracy also improved greatly - no more deliveries to the wrong doorstep. Protocols like FTP and email soon became the de-factos for digital data exchange and file transfer.

But still, not everything was solid and foolproof:

Acknowledgement and Integrity

There was no receipt or acknowledgement/confirmation; you could send a PO, wait several days for the invoice, and finally phone them to find out they have somehow missed the file - still lying in a corner on their FTP server.

Also, the network did not guarantee integrity (delivery "as a whole"); you could have sent a 1000-row file, and the last 10 - or even 10 randoms from all over the place - could have gotten "lost". Even worse, a random flip of a few bytes could convert some five-digit price tag into a nine-digit one - I won't even start on the consequences.

Receipts and Hashes

So people and systems started to acknowledge these transfers with digital receipts; if you sent a file and received an acknowledgement (possibly another file) in return, you had a guarantee (non-repudiation) that the trading partner (or his system) has seen the file and taken it up for processing.

Now that there was two-way communication, there was also a way to verify that the receiver got exactly what the sender sent out (integrity); the receiver would calculate a hash (a finite-length summary; think of finding the "lucky number" from your birthday) of the received file, and send it back in the receipt. The sender could then do the same on the original file she sent, and verify that the two match - which would mean the content on both sides is practically identical (down to just a subatomic mergin of error).

Two problems solved; but still, there were more serious ones lurking around.

Eavesdropping, Tampering and Spoofing

File content was flowing through the (inherently unsecure) internet, in cleartext. This meant, any of your adversaries could eavesdrop on the files you are sending - and worse, even modify (tamper) the data being transmitted, to sabotage or manipulate your transactions.

True, you are already exchanging and validating the hash for the sent/received content; but a decent man-in-the-middle could easily pause the original data flow, do his modification and generate a new hash that would match the modified data.

(Or, if he had complete control over the data flow, he could simply intercept the partner's receipt as well, replace the hash with the old one, and relay it back to you, the unsuspecting victim - who would naturally assume everything went along just fine.)

On the other hand, when it comes to receiving files or receipts back from your partner, you had to somehow make sure that it was actually your partner sending those data in the first place - not something spoofed by a malicious third party.

Digital Signature: for Tamper-proof Integrity

What we needed here was simple: a way for you (sender) to place a "seal" on the file (very much like the wax seals from the old snail-mail days) and for your partner (recipient) to verify that the seal was not broken while things were in transit.

So the digital signature was born. Basically you take the hash of the file (as before), encrypt it with your private key, and send it alongside; and publish the corresponding public key (certificate) so that other people - usually your partners - can decrypt (verify) that blurb of data upon receipt.

This was a rather smart move! It achieved many goals:

  • Since only you had the private key, the partner now had a strong guarantee that the file came from you - and you couldn't refute that either, even if you wanted to; authentication and non-repudiation.
  • The "safely transmitted" hash guaranteed the file's intactness; integrity.
  • Even if there was no anti-tamper encryption (next section) in place, if anyone managed to modify the file and generate a new matching hash, they would not be able to generate a verifiable signature out of it - without breaking into your private space and stealing your very own private key.

Good; now if we could just make the file unreadable to prying eyes, in transit...

Encryption: for Invisibility (Confidentiality)

As the title suggests, encrypting the file content makes it invisible (more correctly, incomprehensible) while it flows through the internet. Which means there is no room for the adversary to read or modify the file content.

Of course there had already been 'secure' protocols like SCP, SFTP and FTP/S that would transfer encrypted content (ciphertext) instead of cleartext; however, you still had to make sure that you were connecting and sending to the correct remote partner/system; jumping over traps like DNS spoofing.

But with AS2 encryption, even if you connected to a wrong server, your content was encrypted using a certificate (public key) that was pre-shared with you by the real partner - and not the certificate of the currently connected server. So, even if you did accidentally send the file to a hacker, they could not possibly decrypt it without access to the partner's own secret private key - which never comes out in the open, throughout the whole process.

AS2: there you are!

What we (or rather, many ingenious people over several decades) invented so far - adds up to the AS2 protocol - which happens to be the most popular secure file transfer protocol today. Especially after Walmart the Giant mandated it for all its trading partners since 2002.

You can learn more about the protocol from its official RFC 4130 document, which is very informative and fairly simple to understand.

Having said that, where are we now?

All set; file transfer is secure! But...

If you scroll up towards the "Digital file transfer" section you would realize that, while we have achieved a remarkable level of security, AS2 has also added a lot of overhead and complexity to our business flow:

AS2: secure, but complex!

Let's see what it takes:

To send one batch of files

  1. share the connectivity data and encryption (and signature) certificates with your trading partner (thankfully, this is a one-time task - for each trading partner, that is)
  2. combine the file(s) you wish to transfer, into one big MIME message
  3. compress the MIME (to reduce its size and save transfer time/bandwidth; could be significant depending on the file data)
  4. compute the hash of the data blurb at hand
  5. sign the blurb with your private key (and keep it secure)
  6. encrypt the whole thing with your partner's public key
  7. transfer the encrypted data over to your partner's system
  8. wait for the receipt (MDN) from the partner
  9. check the signature of the MDN against the partner's public key (to make sure it wasn't forged by someone else - and hasn't been tampered with)
  10. process the MDN to see if everything went fine - or whether there were any errors on the partner's end (opening up a whole new pathway for troubleshooting)
  11. check the hash (message integrity code, MIC) sent in the MDN against what you calculated in the beginning, to make sure that the partner received the same data that you sent out - untampered
  12. if the MDN was asynchronous, acknowledge it properly to close out the connection (and with it, the transaction)

And when receiving a batch of files:

The situation is as much complicated:

  1. share the connectivity data and certificates (one-time per partner)
  2. keep on listening or checking for incoming messages from your partner
  3. when one is received, accept and load the incoming blurb of data
  4. decrypt the data using your own private key
  5. extract out and verify the signature part using your partner's public key - to ensure integrity
  6. decompress the data part to get to the MIME of files
  7. decompose the MIME into original files sent by your partner
  8. compute the MIC over the actual data
  9. compose an MDN with the MIC, and any errors you encountered during the process so far
  10. sign the MDN with your private key - for authenticity
  11. send the MDN back to your partner, and ensure they received it

Obviously, too much of a technical burden for an already-complicated business flow, and a team that is already overwhelmed with a thousand and one other business-critical tasks and needs.

There are tools like the openssl suite that can take good care of individual steps; but chaining them up to fulfill the overall flow is not very trivial or feasible, even for a tech-savvy team; and most business teams cannot afford such a subdivision of techies in the first place.

Managing the transfers: more overhead

While it solves the security aspect and provides an end-to-end guarantee of the file transfer, AS2 by itself does not fully solve the caveats of file transfer, especially being over the unreliable internet:

  • If the transfer fails mid-way - or the partner system reports an error, confirming that it couldn't get hold of the files - you need to replay the transfer.
  • If the transfer itself is successful but you don't receive a receipt (MDN) after a considerable wait, a replay may be required as well.

Furthermore:

  • As your business grows, it should be able to automate the AS2 transfer aspect as well - rather than having to manually submit each file or manually inspect every MDN receipt.
  • Going with the ever-growing trend of digital transformation, file transfer also needs to become a well-streamlined piece of the overall business data flow in your organization.
  • Setting up and juggling with multiple trading partners, with dozens of certificates, endpoints and pieces of configuration, can soon become daunting and error-prone.
  • Auditing and keeping track of transfer history and failures could be important for QoS/reliability assessments as well as business-level audits.

And that is why many new as well as well-established businesses are turning towards:

Managed AS2 and MFT Solutions

Managed File Transfer collectively refers to addressing the pain points we discussed so far:

  • All the security and end-to-end delivery guarantees of AS2 (or another MFT protocol)
  • Easy and simplified configuration management for multiple trading partners and local identities (stations)
  • Enhanced delivery and visibility, such as automatic retries and notifications in case of permanent failures
  • Enterprise-ready: ability to easily integrate with your existing business systems via popular mechanisms like APIs, queues, webhooks, secondary file transfer protocols, cloud services etc.

Such MFT systems can be:

The choice depends mainly on:

  • your expected file exchange load a.k.a. volume (distribution and numbers per minute/hour/day, average size of one file, etc.)
  • the quality of service (QoS) that you expect (for example, will a couple minutes' downtime - fairly common in shared, multi-tenant SaaS platforms - critically affect your business? If so, a dedicated deployment with a fail-over/high-availability (HA) server is the way to go.)
  • how deeply you wish to integrate with or customize the platform; a regular SaaS would no longer make sense beyond a certain point - in contrast, a dedicated deployment would always offer the highest and fastest level of adaptability
  • your organization policies, and security/compliance requirements (e.g. PCI, SOC2, GDPR)
  • and, of course, your budget

Where to Go from Here

AS2 Gateway is a simple yet mature AS2-based MFT platform, nurtuted and trusted over a decade of production use by customers all over the globe.

In addition to an intuitive, email-like interface for exchanging files with your trading partners, it offers automation mechanisms like SFTP and a REST API that will closely integrate with your own order processing system and business workflow. More integration methods like webhooks and cloud storage like AWS S3 are already on the way.

Getting started with AS2 Gateway SaaS cloud hosted platform is pretty easy; there is comprehensive documentation, demo videos and troubleshooting references to guide you every step of the way. Besides, the friendly AS2 Gateway team is ready to help you at any hour of the day.

If your use case calls for a dedicated installation, with higher messaging capacity, cloud or on-premise hardware, and custom integrations with your own systems, it is just a quick call or form-fill away.

Get Notified on Slack for your CloudWatch/Lambda Error Logs: The Sigma Way

Originally written for The SLAppForge Blog; Jun 22, 2020

Slack is already the de-facto "channel" for team collaboration as well as low-to-moderate scale event monitoring and notifications. When we started working on the world's first serverless MFT platform, we decided we should go down the Slack path for first-level monitoring; similar to how we already do it efficiently with our conventional AS2 Gateway.

Not really real-time - and no need to be

When you mention Slack, real-time usually comes to mind; however when it comes to monitoring, real-time often becomes a PITA. Especially when a new user is trying out our platform and making tons of mistakes; or when an AS2 partner endpoint goes down, and our outgoing messages start failing in dozens by the second (all of them are accounted for, so there's really no need - and nothing - to do on our end).

Of course the correct way to handle these critical ("database is down!") vs non-critical (temporary endpoint failure, like above) errors, is via proper prioritization on the reporting end. However, until we have such a mechanism in place, we wanted to receive batched summaries of errors; at a high-enough frequency so that we can still act on them in time (before the user abandons us and moves on!).

Serverless logging: a natural candidate for batched reporting

In AS2 Gateway we didn't have much luck with batching (yet), because the Slack publisher was a standard Log4J2 appender. If we wanted to batch things up, we would have to do it ourselves within the appender - bringing in queueing, and dragging in a host of other issues along with it.

But in MFT Gateway everything is distributed and abstracted out - Lambda, API Gateways, Step Functions, and so on. There is no badass "Slack appender"; all logs go into one sink, CloudWatch, neatly arranged across log groups and streams.

So we need to come up with a relatively passive system (unlike before, where the appender - part of the application itself - does the alerting for us, as soon as the events have occurred). Either CloudWatch has to push events to us, or we have to pull events from them.

Some native, push-based alternatives - that (alas) didn't work

In fact, initially we did research upon possible ways to get CloudWatch - or one of its own supported sinks a.k.a. downstreams - to push alerts to us; but that didn't turn out as good as we thought:

  • CloudWatch Alarms are state-transition-based; once the alarm "rings" (goes red) and fires an alert, it won't ring again until the system returns to an "okay" (green) state. The obvious caveat is that, if multiple errors occur over a short period of time, we would get notified only for the first one. Consequently, if we were to not risk missing out on any errors, we would have to keep on monitoring the logs after each alarm - until the alarm itself goes back green. (We could set up multiple alarms for different kinds of errors; but that's not very scalable with application evolution - and not budget-friendly at all, at $0.30 per each high-resolution alarm).
  • Triggering a Lambda Function via CloudWatch Logs was also an option. But it didn't make much sense (neither from scalability nor from financial perspectives) because it didn't provide a sufficient batching scope (let's say it was "too real-time"); if our original (application) Lambda produced 100 error logs within a minute, we might potentially even end up with 100 monitoring Lambda invocations! (And I'm pretty sure that it wasn't offering filtering as well, the last time we checked 🤔 which would have been a disaster, needless to say.)
  • Draining Logs to Kinesis was another option; however this involves the hourly charges of a Kinesis stream, in addition to the need for an actual processing component (perhaps a Lambda, again) to get alerts into Slack.
  • Aggregating Logs into Elasticsearch was the final one; obviously we didn't want to take that path because it would shamefully add a server component to our serverless platform (true, with AWS it's a managed service; but having to pay a hourly fee took that feeling away from us 😎); besides, running Elasticsearch just to get error alerts sounded a bit overkill as well.

Lambda pull: final call

So we ended up with an old-school, pull-based approach - with the Lambda running once every few minutes, polling a set of log groups for WARN and ERROR logs over the last few minutes (incidentally, since the previous run of the same Lambda), and sending these (if any) to our Slack channel in one big batched-up "report" message.

Good news: you can deploy it on your own AWS account - right now!

We used our own dog food, the Sigma IDE - the one and only serverless IDE, purpose-built for serverless developers. We kept all necessary configuration parameters in environment variables, so that you can simply fill them up with values matching your own infra - and deploy it to your own AWS account in one click.

But the best part is...

...you don't need anything else to do it!

I bet you're reading this article on a web browser;

  1. fire up Sigma in the next tab,
  2. log in and provide your AWS credentials, if not already done,
  3. open the ready-made project by simply entering its GitHub URL https://github.com/janakaud/cloudwatch-slack-alerts,
  4. fill in the environment variables, and
  5. click Deploy Project!

Zero tools to download, packages/dependencies to install, commands to run, config files to edit, bundles to zip-up and upload, ...

But for the inquisitive reader, doesn't that kill the fun part?

Even so, some of you might be curious as to how it all works. Curiosity kills cats, but after all that's what makes us builders and developers; won't you say? 😎

The Making-of: a ground-up overview of the Slack alert reporter

First, a Word of Warning

The permission customization feature used here may not be available for free accounts. However as mentioned before, you can still open the ready-made project from GitHub and deploy it right away - and tweak _any other_ aspects (code, libraries, trigger frequency, function configs etc.) - using a free Sigma account. We do have plans to open up the permission manager and template editor to the free tier; so, depending on the time you are reading this, you may be lucky!

If you need to "just get the damn thing deployed", skip back to the Good news: section - right away!

Preparation

  1. Integrate the Incoming Webhooks app with your Slack workspace.
  2. When integration is complete, grab the webhook integration URL from the final landing page.
  3. Create a suitable Slack channel to receive your alerts.
  4. Make a list of log group names that you need to check for alert-worthy events.
  5. Decide on a pattern to check (filter) the logs for alerts - you can use any of the standard syntaxes supported by CloudWatch Logs API.
  6. Decide how frequently you want to poll the log groups - this will go into the Lambda trigger as well as a code-level "look-behind" parameter which we'll use when calling the API.

The plan

  1. Set up a CloudWatch Events scheduled trigger to run the Lambda at a desired time period
  2. Parallelly poll each log group for new logs matching our pattern, within the look-behind window - which should ideally the same as the trigger period (e.g. if your Lambda runs once in every 5 minutes, checking the last 5 minutes of logs in each cycle should suffice).
  3. Filter out empty results and build a string "report" with events categorized by log group name
  4. Post the report to the Slack channel via the webhook.

Before you start coding

  1. Sign in to Sigma IDE and create a new NodeJS AWS project. (We could have used Python as well, but would have had to handle _parallelism_ ourselves.)
  2. If you don't like the default file name (which gets opened as soon as you land in the editor), change it - and its Function Name - to something else.
  3. Unless you like the native http module, add a dependency to make the webhook call; we picked axios.

Environment variables

Add a few of 'em, as follows:

  • POLL_PERIOD_MS: the "look-behind" period for the log query (in millis) - for the current implementation, it should be the same as what you set for the period of the timer trigger (below).
  • LOG_GROUPS: a space-separated list of log groups that you need to check; if a group is not prefixed with a namespace (e.g. /aws/apigateway/ it will default to /aws/lambda/
  • LOG_PATTERN: the search pattern to "filter in" the alert-worthy logs; ?ERROR ?WARN ("at least one of ERROR and WARN") could be good enough to capture all errors and warnings (depends on your internal logging formats, of course)
  • SLACK_WEBHOOK_URL: speaks for itself; the webhook URL you grabbbed during Preparation
  • SLACK_CHANNEL: again, trivial; the "hashname" (maybe #bada_boom) of the channel you created earlier
  • SLACK_USER: the name of the user (bot) that the alerts would appear to be coming from

There are other cool features supported by the Incoming Webhooks integration; a few small tweaks in the webhook-invocation part, and you could be swimming in 'em right away.

Except the first one, you may need to prevent values of these variables from being persisted into version control when you save the project; in Sigma, you can make their values non-persistent with a simple flip of a switch.

When you reopen the project after a decade, Sigma will automagically pull them in from the deployed Lambdas, and populate the values for you - so you won't need to rack your notes (or brains) to recall the values either!

The timer (CloudWatch scheduled) trigger

  1. Drag-n-drop a CloudWatch Events trigger from the Resources pane on the left, on to the event variable of the function header.
  2. Under New RuleSchedule tab, enter the desired run frequency as a rate expression (e.g. rate(5 minutes)). You can also use a cron if desired, but it may be a bit trickier to compute the look-behind window in that case.
  3. Click Inject.

And now, the cool stuff - the code!

Let's quickly go through the salient bits and pieces:

Poll the log groups asynchronously

This transforms any returned log events (matching our time range and filter pattern) into a formatted "report string":

logGroupName

some log line matching ERROR or WARN (in the group's own logging format)

another line

another one

...

and returns it (or null if nothing was found):

// if not namespaced, add default Lambda prefix
let logGroupName = g.indexOf("/") < 0 ? `/aws/lambda/${g}` : g;

let msg, partial = true;
try {
	let data = await logs.filterLogEvents({
		logGroupName,
		filterPattern,
		startTime,
		limit: 100
	}).promise();
	msg = data.events.map(e => e.message.substring(0, 1000).trim()).join("\n\n");
	partial = !!data.nextToken;
} catch (e) {
	msg = `Failed to poll ${g}; ${e.message}`;
}
return msg.length > 0 ? `\`${g}${partial ? " [partial]" : ""}\`

\`\`\`
${msg}
\`\`\`` : null;

Fan-out and aggregate across log groups

We can poll each log group independently to save time, and merge everything into a final, time-prefixed report - ready to post to Slack:

let checks = groups.map(async g => {
	// the code we saw above
});

return await Promise.all(checks)
	.then(msgs => {
		let valid = msgs.filter(m => !!m);
		if (valid.length > 0) {
			postMsg(`*${timeStr}*

${valid.join("\n\n")}`);
		}
	})
	.catch(e => postMsg(`*FAULT* ${timeStr}

\`\`\`
${e.message}
\`\`\``));
};

Post it to the Slack webhook

const postMsg = text => {
	return axios.post(hook, {
		channel, username,
		text,
		icon_emoji: ":ghost:"
	});
};

Putting it all together

Throw in the imports, client declarations, a bit of sanity (esp. around environment variable loading), and some glue - and you have your full-blown Lambda!

See it in GitHub - in all its raw glamour.

CloudWatch Logs IAM permissions

Since our Lambda is going to access CloudWatch Logs API (for which Sigma does not auto-generate permissions yet), we need to add a permission entry to the Lambda's IAM policy:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Resource": [
        {
          "Fn::Sub": "arn:aws:logs:${AWS::Region}:${AWS::AccountId}:log-group:*"
        }
      ],
      "Action": [
        "logs:FilterLogEvents"
      ]
    }
  ]
}

In this case we've granted access to all log groups; depending on your monitoring scope you may be able to further narrow it down using the ARN pattern on the Resource entry.

If you're following along on the free tier, the Custom Permissions tab may not be editable for you - depending on when you're reading this. Regardless, as mentioned before, that won't prevent you from opening, modifying and deploying the already-tuned-up project from GitHub!

Deploy it!

That one's easy - just click the Deploy Project button on the IDE toolbar!

Within a few seconds, Sigma will show you a deployment progress dialog, and give the green light ("successfully deployed") shortly afterwards - usually in under a minute.

Okay, the Slack reporter is live. And now...

All that's left to do is to sit back and relax - while the Lambda churns your CloudWatch logs and alerts you of any fishy stuff that it finds!

Bonus: some troubleshooting tips, just in case

If you don't receive anything on Slack (even when clearly there are error logs within the last few minutes), or notice some other weird behavior (like duplicate/repeated alerts), it may be time to check the logs of the log-polling Lambda itself 😎

You can just pop the in-IDE SigmaTrail log viewer and view the latest execution logs ([PROD]) of the Lambda, or use the CLI (aws logs filter-log-events --log-group-name /aws/lambda/YourLambdaNameHere) or the web console to view 'em officially from AWS side.

The default code will log a line during each run, indicating the starting timestamp (from which point in time it is checking for new logs, within the current cycle); this could be handy to determine when the function has run, and what "log ground" it has covered.

Stay in touch!

Lambda is cool, and serverless is super-cool - given all the things you can accomplish, even without using Lambda!

So stay tuned for more serverless goodies - from the pioneers in serverless dev-tools 😎

One Bite of Real-world Serverless: Controlling an EC2 with Lambda, API Gateway and Sigma

Originally written for The SLAppForge Blog; Jun 19, 2020

I have been developing and blogging about Sigma, the world's first serverless IDE for serverless developers - but haven't really been using it for my non-serverless work. That was why, when a (somewhat) peculiar situation came up recently, I decided to give Sigma a full-scale spin.

The Situation: a third party needs to control one of our EC2 instances

Our parent company AdroitLogic, sells an enterprise B2B messaging platform called AS2 Gateway - which comes as a simple SaaS subscription as well as an on-premise or cloud-installable dedicated deployment. (Meanwhile, part of our own team is also working on making it a completely serverless solution - we'll probably be bothering you with a whole lotta blog posts on that too, pretty soon!)

One of our potential clients needed a customized copy of the platform, first as a staging instance in our own AWS account; they would configure and test their integrations against it, before deciding on a production deployment - under a different cloud platform of their choice, in their own realm.

Their work time zone is several hours ahead of ours; keeping aside the clock skew on emails and Zoom calls, the staging instance had to be made available during their working hours, not ours.

Managing the EC2 across time zones: the Options

Obviously, we did have a few choices:

  • keep the instance running 24/7, so our client can access it anytime they want - obviously the simplest but also the costliest choice. True, one hour of EC2 time is pretty cheap - less than half a dollar - but it tends to add up pretty fast; while we continue to waste precious resources on a mostly-idling EC2 VM instance.
  • get up at 3 AM (figure of speech) every morning and launch the instance; and shut it down when we sign off - won't work if our client wishes to work late nights; besides they don't get the chance to do the testing every day, so there's still room for significant waste
  • fix up some automated schedule to start and stop the instance - pretty much the same caveats as before (minus the "getting up at 3 AM" part)
  • delegate control of the instance to our client, so they can start and stop it at their convenience

Evidently, the last option was the most economical for us (remember, the client is still in evaluation stage - and may decide not to go with us, after all), and also fairly convenient for them (just two extra steps, before and after work, plus a few seconds' startup delay).

Client-controlled EC2: how to KISS it, the right way

But on the other hand, we didn't want to overcomplicate the process either:

  • Giving them access to our AWS console was out of the question - even with highly constrained access.
  • A key pair with just ec2:StartInstances and ec2:StopInstances IAM permissions on the respective instance ID, would have been ideal; but it would still mean they would have to either install the AWS CLI, or write (or run) some custom code snippets every time they wanted to control the instance.
  • AWS isn't, and wasn't going to be, their favorite cloud platform anyway; so any AWS-specific steps would have been an unnecessary overhead for them.

KISS, FTW!

Serverless to the rescue!

Most probably, you are already screaming out the solution: a pair of custom HTTP (API Gateway) endpoints backed by dedicated Lambdas (we're thinking serverless, after all!) that would do that very specific job - and have just that permission, nothing else, keeping with the preached-by-everybody, least privilege principle.

Our client would just have to invoke the start/stop URL (with a simple, random auth token that you choose - for extra safety), and EC2 will obey promptly.

  • No more AWS or EC2 semantics for them,
  • our budget runs smooth,
  • they have full control over the testing cycles, and
  • I get to have a good night's sleep!

ec2-control: writing it with Sigma

There were a few points in this projects that required some advanced voodoo on Sigma side:

  • Sigma does not natively support EC2 APIs (why should it; it's supposed to be for serverless computing 😎) so, in addition to writing the EC2 SDK calls, we would need to add a custom permission for each function policy; to compensate for the automatic policy generation aspect.
  • The custom policy would need to be as narrow as possible: just ec2:StartInstances and ec2:StopInstances actions, on just our client's staging instance. (If the URL somehow gets out and some remote hacker out there gains control of our function, we don't want them to be able to start and stop random - or perhaps not-so-random - instances in our AWS account!)
  • Both the IAM role and the function itself, would need access to the instance ID (for policy minimization and the actual API call, respectively).
  • For reusability (we devs really love that, don't we? 😎) it should be possible to specify the instance ID (and the auth token) on a per-deployment basis - without embedding the values in the code or configurations, which would get checked into version control.

Template Editor FTW

Since Sigma uses CloudFormation under the hood, the solution is pretty obvious: define two template parameters for the instance ID and token, and refer them in the functions' environment variables and the IAM roles' policy statements.

Sigma does not natively support CloudFormation parameters (our team recently started working on it, so perhaps it may actually be supported at the time you read this!) but it surely allows you to specify them in your custom deployment template - which would get nicely merged into the final deployment template that Sigma would run.

Some premium bad news, and then some free good news

At the time of this writing, both the template editor and the permission manager were premium features of Sigma IDE. So if you start writing this on your own, you would either need to pay a few bucks and upgrade your account, or mess around with Sigma's configuration files to hack those pieces in (which I won't say is impossible 😎).

(After writing this project, I managed to convince our team to enable the permission manager and template editor for the free tier as well 🤗 so, by the time you read this, things may have taken a better light!)

But, as part of the way that Sigma actually works, not having a premium account does not mean that you cannot deploy an already template- or permission-customized project written by someone else; and my project is already in GitHub so you can simply open it in your Sigma IDE and deploy it, straightaway.

"But how do I provide my own instance ID and token when deploying?"

Patience. Read on.

"Old, but not obsolete" (a.k.a. more limitations, but not impossible)

As I said before, Sigma didn't natively support CloudFormation parameters; so even if you add them to the custom template, Sigma would just blindly merge and deploy the whole thing - without asking for actual values of the parameters!

While this could have been a cause for deployment failures in some cases, lucky for us, here it doesn't cause any trouble. But still, we need to provide correct, custom values for that instance ID and protection token!

Amazingly, CloudFormation allows you to just update the input parameters of an already completed deployment - without having to touch or even re-submit the deployment template:

aws cloudformation update-stack --stack-name Whatever-Stack \
  --use-previous-template --capabilities CAPABILITY_IAM \
  --parameters \
  ParameterKey=SomeKey,ParameterValue=SomeValue ...

(That command is already there, in my project's README.)

So our plan is simple:

  1. Deploy the project via Sigma, as usual.
  2. Run an update from CloudFormation side, providing just the correct instance ID and your own secret token value.

Enough talk, let's code!

Warning: You may not actually be able to write the complete project on your own, unless we have enabled custom template editing for free accounts - or you already have a premium account.

If you are just looking to deploy a copy on your own, simply open my already existing public project from https://github.com/janakaud/ec2-control - and skip over to the Ready to Deploy section.

1. ec2-start.js, a NodeJS Lambda

Note: If you use a different name for the file, your custom template would need to be adjusted - don't forget to check the details when you get to that point.

const {ec2} = require("./util");
exports.handler = async (event) => ec2(event, "startInstances", "StartingInstances");

API Gateway trigger

After writing the code,

  1. drag-n-drop an API Gateway entry from the left-side Resources pane, on to the event variable of the function,
  2. enter a few details -
    1. an API name (say EC2Control),
    2. path (say /start, or /ec2/start),
    3. HTTP method (GET would be easiest for the user - they can just paste a link into a browser!)
    4. and a stage name (say prod)
  3. under Show Advanced, turn on Enable Lambda Proxy Integration so that we will receive the query parameters (including the auth token) in the request
  4. and click Inject.

Custom permissions tab

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Resource": {
                "Fn::Sub": "arn:aws:ec2:${AWS::Region}:${AWS::AccountId}:instance/${EC2ID}"
            },
            "Action": [
                "ec2:StartInstances"
            ]
        }
    ]
}

2. ec2-stop.js, a NodeJS Lambda

Note: As before, if your filename is different, update the key in your custom template accordingly - details later.

const {ec2} = require("./util");
exports.handler = async (event) => ec2(event, "stopInstances", "StoppingInstances");

API Gateway trigger

Just like before, drag-n-drop and configure an APIG trigger.

  1. But this time, make sure that you select the API name and deployment stage via the Existing tabs - instead of typing in new values.
  2. Resource path would still be a new one; pick a suitable pathname as before, like /ec2/stop (consistent with the previous).
  3. Method is also your choice; natural is to stick to the previously used one.
  4. Don't forget to Enable Lambda Proxy Integration too.

Custom permissions tab

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Resource": {
                "Fn::Sub": "arn:aws:ec2:${AWS::Region}:${AWS::AccountId}:instance/${EC2ID}"
            },
            "Action": [
                "ec2:StopInstances"
            ]
        }
    ]
}

3. util.js, just a NodeJS file

const ec2 = new (require("aws-sdk")).EC2();

const EC2_ID = process.env.EC2_ID;
if (!EC2_ID) {
    throw new Error("EC2_ID unavailable");
}
const TOKEN = process.env.TOKEN;
if (!TOKEN) {
    throw new Error("TOKEN unavailable");
}

exports.ec2 = async (event, method, resultKey) => {
    let tok = (event.queryStringParameters || {}).token;
    if (tok !== TOKEN) {
        return {statusCode: 401};
    }
    let data = await ec2[method]({InstanceIds: [EC2_ID]}).promise();
    return {
        headers: {"Content-Type": "text/plain"},
        body: data[resultKey].map(si => `${si.PreviousState.Name} -> ${si.CurrentState.Name}`).join("\n")
    };
};

Code is pretty simple - we aren't doing much, just validating the incoming token, calling the EC2 API, and returning the state transition result (e.g. running -> stopping) back to the caller as confirmation; e.g. it will appear in the our client's browser window.

(If you were wondering why we didn't add aws-sdk as a dependency despite require()ing it; that's because aws-sdk is already available in the standard NodeJS Lambda environment. No need to bloat up our deployment package with a redundant copy - unless you wish to use some cutting-edge feature or SDK component that was released just last week.)

The better part of the coordinating fat and glue, is in the custom permissions and the template:

4. Custom template

{
  "Parameters": {
    "EC2ID": {
      "Type": "String",
      "Default": ""
    },
    "TOKEN": {
      "Type": "String",
      "Default": ""
    }
  },
  "Resources": {
    "ec2Start": {
      "Properties": {
        "Environment": {
          "Variables": {
            "EC2_ID": {
              "Ref": "EC2ID"
            },
            "TOKEN": {
              "Ref": "TOKEN"
            }
          }
        }
      }
    },
    "ec2Stop": {
      "Properties": {
        "Environment": {
          "Variables": {
            "EC2_ID": {
              "Ref": "EC2ID"
            },
            "TOKEN": {
              "Ref": "TOKEN"
            }
          }
        }
      }
    }
  }
}

Note: If you used some other/custom names for the Lambda code files, two object keys (ec2Start, ec2Stop) under Resources would be different - it's always better to double-check with the auto-generated template and ensure that the merged template also displays the properly-merged final version.

Deriving that one on your own, isn't total voodoo magic either; after writing the rest of the project, just have a look at the auto-generated template tab, and write up a custom JSON - whose pieces would merge themselves into the right places, yielding the expected final template.

We accept the EC2ID and TOKEN as parameters, and merge them into the Environment.Variables property of the Lambda definitions. (The customized IAM policies are already referencing the parameters via Fn::Sub so we don't need to do anything for them here.)

Once we have the template editor in the free tier, you would certainly have much more cool concepts to play around with - and probably also figure out so many bugs (full disclaimer: I was the one that initially wrote that feature!) which you would promptly report to us! 🤗

Ready to Deploy

When all is ready, click Deploy Project on the toolbar (or Project menu).

(If you came here on the fast-track (by directly opening my project from GitHub), Sigma may prompt you to enter values for the EC2_ID and TOKEN environment variables - just enter some dummy values; we are changing them later anyways.)

If all goes well, Sigma will build the project and deploy it, and you would end up with a Changes Summary popup with an outputs section at the bottom containing the URLs of your API Gateway endpoints.

If you accidentally closed the popup, you can get the outputs back via the Deployment tab of the Project Info window.

Copy both URLs - you would be sending these to your client.

Sigma's work is done - but we're not done yet!

Update the parameters to real values

Grab the EC2-generated identifier of your instance, and find a suitable value for the auth token (perhaps a uuid -v4?).

Via AWS CLI

If you have AWS CLI - which is really awesome, by the way - the next step is just one command; as mentioned in the README as well:

aws cloudformation update-stack --stack-name ec2-control-Stack \
  --use-previous-template --capabilities CAPABILITY_IAM --parameters \
  ParameterKey=EC2ID,ParameterValue=i-0123456789abcdef \
  ParameterKey=TOKEN,ParameterValue=your-token-goes-here

(If you copy-paste, remember to change the parameter values!)

We tell CloudFormation "hey, I don't need to change my deployment definitions but want to change the input parameters; so go and do it for me".

The update usually takes just a few seconds; if needed, you can confirm its success by calling aws cloudformation describe-stacks --stack-name ec2-control-Stack and checking the Stacks.0.StackStatus field.

Via the AWS Console

If you don't have the CLI, you can still do the update via the AWS Console; while it is a bit overkill, the console provides more intuitive (and colorful) feedback regarding the progress and success of the stack update.

Complete the URLs - plus one round of testing

Add the token (?token=the-token-you-picked) to the two URLs you copied from Sigma's deployment outputs. Now they are ready to be shared with your client.

1. Test: starting up

Finally, just to make sure everything works (and avoid any unpleasant or awkward moments), open the starter-up URL in your browser.

Assuming your instance was already stopped, you would get a plaintext response:

stopped -> pending

Within a few seconds, the instance will enter running status and become ready (obviously, this transition won't be visible to the user; but that shouldn't really matter).

2. Test: stopping

Now open the stopper URL:

running -> stopping

As before, stopped status will be reached in background within a few seconds.

0. Test: does it work without the token - hopefully not?

The "unauthorized" response doesn't have a payload, so you may want to use curl or wget to verify this one:

janaka@DESKTOP-M314LAB:~ curl -v https://foobarbaz0.execute-api.us-east-1.amazonaws.com/ec2/stop
*   Trying 13.225.2.77...
* ...
* SSL connection using TLS1.2 / ECDHE_RSA_AES_128_GCM_SHA256
* ...
* ALPN, server accepted to use http/1.1

> GET /ec2/stop HTTP/1.1
> Host: foobarbaz0.execute-api.us-east-1.amazonaws.com
> User-Agent: curl/7.47.0
> Accept: */*
>

< HTTP/1.1 401 Unauthorized
< Content-Type: application/json
< Content-Length: 0
< Connection: keep-alive
< Date: Thu, 18 Jun 2020 06:14:58 GMT
< x-amzn-RequestId: ...

All good!

Now go ahead - share just those two token-included URLs with your client - or whatever third party that you wish to delegate the EC2 control; and ask them to use 'em wisely and keep 'em safe.

If the third party loses the URL(s); and the bad guy who got them, starts playing with them unnecessarily (stopping and starting things rapidly - or at random hours - for example): just run an aws cloudformation update-stack with a new TOKEN - to cut out old access! Then share the new token with your partner, obviously warning them to be a lot more careful.

You can also tear down the whole thing in seconds - without a trace of existence (except for the CloudWatch logs from previous runs) - via:

  • Sigma's Undeploy Project toolbar button or Project menu item,
  • aws cloudformation delete-stack on the CLI, or
  • the AWS console.

Lastly, don't forget to stay tuned for more serverless bites, snacks and full-course meals from our team!

Why you NEED MFT - "Managed" File Transfer - in this 2020 Decade

(Or, what the "Managed" part of MFT is all about)

Originally written for The MFT Gateway Blog; Apr 24, 2020

Not to sound opportunistic, but the whole COVID-19 situation is leading online businesses to boom like never before (source: BBC). With this, inevitably, the value of computerized trading and automated, secure trade document exchange comes boldly into the picture; and this is also the part where we say, "you should have adopted a MFT solution last year... but it's not yet too late!"

But I already have "File Transfer" - why "Managed"?

Let's take the best path to learning - by example.

Assume you need to send a long list of order items to your vendor. You and your vendor have decided to use the world-famous File Transfer Protocol (FTP) to exchange these docs. It's all very simple:

ftp
open your.partners.host port
cd incoming-orders/your-orders-folder
put your-orders-file-yyyy-MM-dd-HH-mm-ss.csv
exit

Done. The file is transferred.

And now, it's time for the questions.

Burning Questions: for Your Brow

Ten files a day? Fine. Hundred? Thousand?

COVID is snarling, and ironically your business is literally exploding. Now, each day, you have to send dozens of these files, to different FTP servers across your vendors - yes, it recently became plural when you decided to expand your online storefront.

Running a hundred ftp command scripts each day by hand, is not exactly a pleasant experience. Plus, you learned the lesson when you mistakenly uploaded file order-A to partner B's server. So you set up a cool program, and script it to automagically send each file to the correct partner - as soon as you drop them into a monitored folder.

Did the whole file get through - unchanged?

Now the script is doing the job - like a boss. But is it doing the job right? Are you sure?

Say the file was in a CSV (spreadsheet-like) format. Do all the rows get transferred? What if the last hundred rows get dropped during the transfer (say, a network glitch) and nobody notices - not your script, not the ftp command, not your vendor?

Worse... what if a few bits got flipped, and a 10000-crate order got changed into a 90000 one?

Yes, we need to get a file "summary" from the vendor (trading partner) - and ensure it matches with the "summary" on our end. We need to compare the digest (hash; like MD5, SHA-1, SHA-256 etc.) of the final file on both ends, to make sure nothing got corrupted or tampered.

Did somebody else (secretly) tap the transferred file?

Phone taps are still fairly common, and computer network taps are no exception - especially when unprotected, plain-text protocols like FTP are in play.

You could encrypt the traffic by switching to FTP/S or SFTP so that the thief trying to tap your file would get just the encrypted (gibberish) blurb.

But... read along.

Who is at the other end? Your partner, or an imposter?

There are easy techniques (like DNS spoofing) even to trick your system into connecting directly to an attacker's system - especially if you're not careful and attentive enough. You would use end-to-end encryption and stuff, but would be sending the data right into the fox's den.

So we need a mechanism to ensure that only our vendor would be able to make sense of (decrypt) the data we send. Perhaps by encrypting it with a public key whose private key part is owned by him and him alone.

The most fundamental: Did your partner see the file?

Your script happily uploads hundreds of files each day; but does it guarantee that those vendors are even looking at those files? If you don't receive back a confirmation, how can you know if the file got to the wrong system - or got lost in the crowd - or some other nasty thing happened?

Yes, we need an acknowledgement (receipt) from the vendor that he got our file.

Burning Questions, Part 2: for Your Partner's Brow

Sadly, that's not the end of the story; your vendor-partner is even more skeptical - almost paranoid, to say the least.

Was it really that guy (or gal) who sent this file?

The imposter thing could work the other way too. Somebody impersonating you, could send a funny order to your vendor; and the next thing you know is, opening your door to a thousand crates of icky goo.

So your partner needs your unique signature at the bottom of that file - to ensure that it came from you.

Were those numbers in the file, corrupted or tampered while in transit?

Already covered this from your own perspective - digest. Moving on.

Did somebody else tap the file?

Also covered - public-key encryption.

How the heck am I supposed to do anything useful in my day - when I have to keep on decrypting, digesting, verifying and acknowledging dozens of files from that guy (or gal)?

Yes. Your partner needs to seriously think of automating the whole thing. Like you did.

Enter: "Managed" File Transfer!

Say you hired a whole development team, worked day and night, and implemented a solution that covers all of the above concerns:

  • adds a signature alongside the file content - so your vendor can verify that it indeed came from you
  • encrypts the file using your vendor's public key - so only your vendor can decrypt it, only using his private key
  • sends a receipt back when a file is received, so you know for sure the file actually reached your vendor's system
  • calculates a digital digest (hash) of the received file and sends it back with the receipt, so you can compare it with your own hash - and make sure the file is intact in your vendor's side
  • and most importantly, automates all of these steps and actions and raises red flags if anything fishy goes on

Congratulations!

You just wrote your own "managed" file transfer solution!

But why did you?! MFT is already out there!

If you check out the features of AS2, for example, you would be surprised (or maybe not?) - it already covers all of those concerns you painfully scratched your brow over!

And it already implements everything that your hired dev team had to go through, day and night!

If you check any other protocol (AS4, OFTP, ??, and the like), the observation will be more or less the same. But AS2 is the most sought-after MFT protocol in the business world (heck, Wal-Mart enforces it!), and going with the trend always has its benefits.

So, that's the beauty of Managed File Transfer, MFT.

It covers the end-to-end delivery and acknowledgement/reporting of files; so once you hand over a file to your MFT system, rest assured that:

  • either the file will be successfully delivered and acknowledged by your partner's system,
  • or (if something goes wrong) your team will be duly notified.

No more staring at ftp folder listings or long dashboards, the whole flow is fully automated and streamlined.

Most MFT platforms also offer value added services (VAS) on top of the basic MFT protocol; such as document (usu. EDI) translation, validation, auto-generation of response documents, and various integrations with third-party systems like CRMs, warehousing and BPMSs.

Meanwhile others give further flexibility by offering intuitive low-level interfacing/integration options; like SFTP- or AWS S3-based triggering of file send-outs, REST APIs for fully automating the management and monitoring of the send/receive process, email-based instant notifications, and so forth.

Okay, back to work - your orders are waiting!

Now, before you go back to uploading those files via FTP, you have a choice to make:

  • continue doing business as usual - upload, wait, call the vendor if nothing happens, rinse, and repeat for the next file?
  • hire an exorbitant dev team and build your own "managed" file transfer solution?
  • install, or more easily, sign up for one of those popular MFT services out there - and dedicate your valuable time to developing your booming business - rather than bothering with file exchanges?

I guess now you know the answer.

Good luck! And we're also waiting eagerly for your queries!

Serverless Monitoring: What do we Monitor when the Server Goes Away?

Originally written for The SLAppForge Blog; Feb 17, 2020

Monitoring your serverless application is crucial - especially while it is handling your production load. This brings us to today's topic, how to effectively monitor a serverless application.

Serverless = Ephemeral

Serverless environments are inherently ephemeral; once the execution completes, you don't have much left behind to investigate.

There's so much exciting talk about container reuse and keep-warm, but theoretically every single one of your function could be a cold-start. Even with reuse, you don't get access to the environment to analyze the previous invocation, until the next one comes in.

So, to effectively monitor a serverless system, we have to gather as much data as possible - while it is actually handling a request.

Monitoring the Ephemeral. The Dynamic. The Inaccessible.

Serverless offers the benefit of less management, through higher levels of abstraction; for obvious reasons, that comes hand in hand with the caveat of less visibility.

In short, you pretty much have to either:

  • depend on what the serverless platform provider discloses to you, or
  • write your own monitoring utilities (telemetry agents, instrumentations etc.) to squeeze out more metrics from the runtime

Log Analysis

Logs are the most common means of monitoring, auditing and troubleshooting traditional applications; not surprisingly, it holds true in serverless as well.

What's in a Log?

All major serverless platforms offer comprehensive logging for their FaaS elements: CloudWatch Logs for AWS Lambda, StackDriver Logging for Google Cloud Functions, Azure Monitor Logs for Azure Functions, CloudMonitor for Alibaba Cloud Functions, and so forth.

The logs usually contain:

  • execution start marker, with a unique ID for each request/invocation
  • all application logs generated during execution, up until the point of completion of the invocation
  • execution summary: duration, resources used, billed quota etc.

Serverless is not just FaaS; and logging is not just about function executions. Other services like storage, data-storage and networking also provide their share of logging. While these are mostly associated with access and security auditing (e.g. CloudTrail auditing), they can still be merged with FaaS logs to enable more verbose, detailed monitoring.

Distributed Logging: A Nightmare?

One problem with the inherently distributed nature of serverless systems, is that these logs are usually scattered all over the place; unlike in a traditional monolith where all logs would be neatly arranged in one or a few well-known files.

Imagine your backend consists of five Lambdas, which the web client invokes in a particular sequence to get a job done - or are coordinated through Step Functions or Destinations; the logs for a single "call" could span five log streams in five log groups.

It may sound simple enough during development; but when your production app goes viral and starts receiving hundreds of concurrent calls, tracking a single client journey through the logs could become harder than finding a needle in a haystack.

Log Aggregation

This is where log aggregation services come into play. (In fact, serverless was lucky - because log management had already received a boost, thanks to microservice architecture.) Services like Coralogix and Dashbird will ingest your logs (via push or pull) and allow you to perform filters, aggregations, summarizations etc. as if they were from one or a few sources.

With visibility to long-term data, aggregation services can - and do - actually provide more intelligent outputs; such as real-time alerts on predefined error levels/codes, and stability- or security-oriented anomaly detection through pattern recognition, machine learning etc.

Resource Usage

Even with your application logic running like clockwork, the system could start failing if it is under-provisioned and runs out of resources; or become a budget-killer if over-provisioned.

Additionally, unusual patterns in resource usage may also indicate anomalies in your applications - such as attacks, misuse and other exploits.

What can we Measure?

  • memory usage
  • execution time or latency: given that FaaS invocations are under strict timeout constraints, time becomes a critical resource. You do not want your function to time-out before completing its job; but you also do not want it to remain hung indefinitely over a bad database connection that is going to take two minutes to time-out on its own.
  • compute power used: in many platforms, allocated compute power grows in proportion to memory, so the product of allocated memory and execution time is a good relative measure for the total compute power consumed by the request. In fact, most platforms actually bill you by GB-seconds where GB refers to memory allocation.

Resource Isolation

Serverless invocations are isolated, which means one failure cannot affect another request. Sadly, it also means that each runtime instance should be able to handle the largest possible input/request on its own, as there is virtually no resource sharing across instances; one request handler cannot "borrow" memory from another, as is the case in monolithic apps with multiple services on the same runtime/VM.

This is bad in a sense, that it denies you of the luxury to maintain shared resources (e.g. connection pools, memory-mapped file caches, etc.) to be pooled across requests. But at the same time, it means that managing and monitoring per-request resources becomes easier; each time you allocate or measure it, you are doing it for a single request.

From Logs

As mentioned before, serverless logs usually contain execution summaries stating the allocated vs. maximum memory usages, and exact vs. billed execution time.

From Runtime Itself

Your application runs as a Linux process, so it is always possible to grab resource usage data from the runtime itself - either via standard language/runtime calls (e.g. NodeJS process.memoryUsage()) or by directly introspecting OS-level semantics (e.g. /proc/self/stat). Some aspects like execution time are also provided by the serverless runtime layer - such as remaining time via context.getRemainingTimeInMillis() on Lambda.

From Serverless Platform Metrics

Most platforms keep track of resource metrics themselves. AWS CloudWatch Metrics is perhaps the best example.

While CloudWatch does not yet offer memory or compute graphs, third party tooling like SigmaDash can compute them on your behalf.

Platforms usually provide better metrics for non-FaaS systems than they do for FaaS; probably because of the same challenges they face in accurately monitoring the highly dynamic environment. But they are constantly upgrading their techniques, so we can always hope for better.

For non-compute services like storage (e.g. AWS S3), persistence (e.g. AWS DynamoDB) and API hosting (e.g. AWS API Gateway), platforms offer detailed metrics. The upside with these built-in metrics is that they offer different granularity levels, and basic filtering and summarization, out of the box. The downside, often, is that they are fairly non-real-time (generally several seconds behind on the actual request).

Also, for obvious reasons, longer analysis periods mean lesser granularity. Although you can request for higher-precision data, or report custom metrics at desired precisions, it will cost you more - and, at serverless scales, such costs can add up fairly quickly.

Invocations, Errors and Throttles

While logs and resource usages can reveal anomalies during execution, these depict more straightforward serverless monitoring metrics related to the actual end results of application invocations. In that sense, resource monitoring can be treated as oriented more towards performance, whereas error/throttle monitoring leans more towards correctness, stability and scalability.

You can usually grab these numbers from the built-in metrics of the platform itself.

Invocations-to-error Ratio

Similar to signal-to-noise ratio (SNR), this is a measure of how "efficient" or "robust" your application is; the higher it is, the more requests your application can successfully serve without running into an error.

Of course, the validity of this serverless monitoring metric depends on the actual nature and intent of your code, as well as the nature of inputs; if you receive lots of erratic or failure-prone inputs, and you are actually supposed to fail the invocation in such cases, this value would be naturally high.

Throttles

Presence of throttles could indicate one of a few things:

  • You have (perhaps mistakenly) throttled your function (or the API/interface) below its regular concurrency limit.
  • Your application is under a denial-of-service (DoS) attack.
  • Your application's scalability goes beyond what the serverless platform can occur (maybe it has gone viral); you need to re-architect it to batch-up invocations, introduce back-offs to safely deal with rejections, etc.

Note that throttling issues are not always limited to front-facing elements; if you are asynchronously triggering a function from an internal data stream (e.g. a Lambda with a Kinesis trigger) you could run into throttling, based on the dynamics of the rest of the application.

Also, throttling is not always in front of FaaS. If a third-party API that your application invokes (e.g. Amazon's own seller APIs) gets throttled, you could run into issues even without an apparent increase in the fronting traffic, events or invocation counts. Worse, it could lead to chain reactions or cascading failures - where throttling of one service can flow down the execution path to others, and eventually to the user-facing components.

Once you are on the upper ends of scalability, only a good combination of API knowledge, fail-safe measures and serverless monitoring techniques - logs, throttles and runtime latency measurements - can save you from such scenarios.

Instrumenting

So far, all serverless monitoring techniques we discussed, have been dependent on what the runtime and platform offer by default, or out-of-shelf. However, people find these inadequate for monitoring serious production applications where:

  • service health and performance are critical,
  • early detection of anomalies and threats is vital, and
  • alerts and recovery actions need to be as fine-grained and real-time as possible.

To get beyond what the platform offers, you usually need to instrument your runtime and grab additional serverless monitoring insights. This can be via:

For FaaS, this obviously comes with implications on performance - your function runtime now has to spare some cycles to gather and report the metrics; in many cases, the tools also require you to modify your application code. Often these changes are subtle enough - such as importing a library, calling a third-party method, etc.; nevertheless, it goes against the developers' and dev-ops' pipe dream of code-free monitoring.

Realizing the demand for advanced serverless monitoring and diagnostics, cloud platforms themselves have also come forth with instrumentation options:

  • AWS X-Ray ships with Lambda by default, and can be enabled via a simple configuration with no code change. Once enabled, it can report the latencies of different invocation phases: initialization, individual AWS API/SDK calls, etc. With a bit of custom code it can also capture calls to any downstream HTTP service - which can come in handy when monitoring and troubleshooting latency issues.
  • Google Cloud Functions offers StackDriver Monitoring for advanced analytics via built-in timing, invocation and memory metrics; as well as custom metrics reporting and alerting.
  • Azure Functions has its own Azure Application Insights service offering performance analytics .

However, the best work related to instrumentation have come from third-party providers, such as the ones mentioned earlier. These usually require you to subscribe to their services, configure your serverless platforms to report metrics to theirs, and provide analysis results and insights intelligence through their own dashboards.

Serverless Monitoring, up against Privacy and Security Concerns

The obvious problem with delegating serverless monitoring responsibilities to third parties, is that you need to guarantee the safety of the data you are sharing; after all, it is your customers' data.

Even if you implicitly trust the third party, you need to take precautions to minimize the damage in case the monitoring data gets leaked - for the sake of your users as well as the security and integrity of your application.

  • Anonymizing and screening your logs for sensitive content,
  • using secure channels for metrics sharing, and
  • always sticking to the principle of least privilege during gathering, sharing and analysis of data

will get you started on a robust yet safe path to efficiently monitoring your serverless application.