Randomizd | Random Thoughts Serialized: SLAppForge

Showing posts with label SLAppForge. Show all posts

Wednesday, June 24, 2020

One Bite of Real-world Serverless: Controlling an EC2 with Lambda, API Gateway and Sigma

Originally written for The SLAppForge Blog; Jun 19, 2020

I have been developing and blogging about Sigma, the world's first serverless IDE for serverless developers - but haven't really been using it for my non-serverless work. That was why, when a (somewhat) peculiar situation came up recently, I decided to give Sigma a full-scale spin.

The Situation: a third party needs to control one of our EC2 instances

Our parent company AdroitLogic, sells an enterprise B2B messaging platform called AS2 Gateway - which comes as a simple SaaS subscription as well as an on-premise or cloud-installable dedicated deployment. (Meanwhile, part of our own team is also working on making it a completely serverless solution - we'll probably be bothering you with a whole lotta blog posts on that too, pretty soon!)

One of our potential clients needed a customized copy of the platform, first as a staging instance in our own AWS account; they would configure and test their integrations against it, before deciding on a production deployment - under a different cloud platform of their choice, in their own realm.

Their work time zone is several hours ahead of ours; keeping aside the clock skew on emails and Zoom calls, the staging instance had to be made available during their working hours, not ours.

Managing the EC2 across time zones: the Options

Obviously, we did have a few choices:

keep the instance running 24/7, so our client can access it anytime they want - obviously the simplest but also the costliest choice. True, one hour of EC2 time is pretty cheap - less than half a dollar - but it tends to add up pretty fast; while we continue to waste precious resources on a mostly-idling EC2 VM instance.
get up at 3 AM (figure of speech) every morning and launch the instance; and shut it down when we sign off - won't work if our client wishes to work late nights; besides they don't get the chance to do the testing every day, so there's still room for significant waste
fix up some automated schedule to start and stop the instance - pretty much the same caveats as before (minus the "getting up at 3 AM" part)
delegate control of the instance to our client, so they can start and stop it at their convenience

Evidently, the last option was the most economical for us (remember, the client is still in evaluation stage - and may decide not to go with us, after all), and also fairly convenient for them (just two extra steps, before and after work, plus a few seconds' startup delay).

Client-controlled EC2: how to KISS it, the right way

But on the other hand, we didn't want to overcomplicate the process either:

Giving them access to our AWS console was out of the question - even with highly constrained access.
A key pair with just ec2:StartInstances and ec2:StopInstances IAM permissions on the respective instance ID, would have been ideal; but it would still mean they would have to either install the AWS CLI, or write (or run) some custom code snippets every time they wanted to control the instance.
AWS isn't, and wasn't going to be, their favorite cloud platform anyway; so any AWS-specific steps would have been an unnecessary overhead for them.

KISS, FTW!

Serverless to the rescue!

Most probably, you are already screaming out the solution: a pair of custom HTTP (API Gateway) endpoints backed by dedicated Lambdas (we're thinking serverless, after all!) that would do that very specific job - and have just that permission, nothing else, keeping with the preached-by-everybody, least privilege principle.

Our client would just have to invoke the start/stop URL (with a simple, random auth token that you choose - for extra safety), and EC2 will obey promptly.

No more AWS or EC2 semantics for them,
our budget runs smooth,
they have full control over the testing cycles, and
I get to have a good night's sleep!

`ec2-control`: writing it with Sigma

There were a few points in this projects that required some advanced voodoo on Sigma side:

Sigma does not natively support EC2 APIs (why should it; it's supposed to be for serverless computing 😎) so, in addition to writing the EC2 SDK calls, we would need to add a custom permission for each function policy; to compensate for the automatic policy generation aspect.
The custom policy would need to be as narrow as possible: just ec2:StartInstances and ec2:StopInstances actions, on just our client's staging instance. (If the URL somehow gets out and some remote hacker out there gains control of our function, we don't want them to be able to start and stop random - or perhaps not-so-random - instances in our AWS account!)
Both the IAM role and the function itself, would need access to the instance ID (for policy minimization and the actual API call, respectively).
For reusability (we devs really love that, don't we? 😎) it should be possible to specify the instance ID (and the auth token) on a per-deployment basis - without embedding the values in the code or configurations, which would get checked into version control.

Template Editor FTW

Since Sigma uses CloudFormation under the hood, the solution is pretty obvious: define two template parameters for the instance ID and token, and refer them in the functions' environment variables and the IAM roles' policy statements.

Sigma does not natively support CloudFormation parameters (our team recently started working on it, so perhaps it may actually be supported at the time you read this!) but it surely allows you to specify them in your custom deployment template - which would get nicely merged into the final deployment template that Sigma would run.

Some premium bad news, and then some free good news

At the time of this writing, both the template editor and the permission manager were premium features of Sigma IDE. So if you start writing this on your own, you would either need to pay a few bucks and upgrade your account, or mess around with Sigma's configuration files to hack those pieces in (which I won't say is impossible 😎).

(After writing this project, I managed to convince our team to enable the permission manager and template editor for the free tier as well 🤗 so, by the time you read this, things may have taken a better light!)

But, as part of the way that Sigma actually works, not having a premium account does not mean that you cannot deploy an already template- or permission-customized project written by someone else; and my project is already in GitHub so you can simply open it in your Sigma IDE and deploy it, straightaway.

"But how do I provide my own instance ID and token when deploying?"

Patience. Read on.

"Old, but not obsolete" (a.k.a. more limitations, but not impossible)

As I said before, Sigma didn't natively support CloudFormation parameters; so even if you add them to the custom template, Sigma would just blindly merge and deploy the whole thing - without asking for actual values of the parameters!

While this could have been a cause for deployment failures in some cases, lucky for us, here it doesn't cause any trouble. But still, we need to provide correct, custom values for that instance ID and protection token!

Amazingly, CloudFormation allows you to just update the input parameters of an already completed deployment - without having to touch or even re-submit the deployment template:

aws cloudformation update-stack --stack-name Whatever-Stack \
  --use-previous-template --capabilities CAPABILITY_IAM \
  --parameters \
  ParameterKey=SomeKey,ParameterValue=SomeValue ...

(That command is already there, in my project's README.)

So our plan is simple:

Deploy the project via Sigma, as usual.
Run an update from CloudFormation side, providing just the correct instance ID and your own secret token value.

Enough talk, let's code!

Warning: You may not actually be able to write the complete project on your own, unless we have enabled custom template editing for free accounts - or you already have a premium account.

If you are just looking to deploy a copy on your own, simply open my already existing public project from https://github.com/janakaud/ec2-control - and skip over to the Ready to Deploy section.

1. `ec2-start.js`, a NodeJS Lambda

Note: If you use a different name for the file, your custom template would need to be adjusted - don't forget to check the details when you get to that point.

const {ec2} = require("./util");
exports.handler = async (event) => ec2(event, "startInstances", "StartingInstances");

API Gateway trigger

After writing the code,

drag-n-drop an API Gateway entry from the left-side Resources pane, on to the event variable of the function,
enter a few details -
1. an API name (say EC2Control),
2. path (say /start, or /ec2/start),
3. HTTP method (GET would be easiest for the user - they can just paste a link into a browser!)
4. and a stage name (say prod)
under Show Advanced, turn on Enable Lambda Proxy Integration so that we will receive the query parameters (including the auth token) in the request
and click Inject.

Custom permissions tab

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Resource": {
                "Fn::Sub": "arn:aws:ec2:${AWS::Region}:${AWS::AccountId}:instance/${EC2ID}"
            },
            "Action": [
                "ec2:StartInstances"
            ]
        }
    ]
}

2. `ec2-stop.js`, a NodeJS Lambda

Note: As before, if your filename is different, update the key in your custom template accordingly - details later.

const {ec2} = require("./util");
exports.handler = async (event) => ec2(event, "stopInstances", "StoppingInstances");

API Gateway trigger

Just like before, drag-n-drop and configure an APIG trigger.

But this time, make sure that you select the API name and deployment stage via the Existing tabs - instead of typing in new values.
Resource path would still be a new one; pick a suitable pathname as before, like /ec2/stop (consistent with the previous).
Method is also your choice; natural is to stick to the previously used one.
Don't forget to Enable Lambda Proxy Integration too.

Custom permissions tab

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Resource": {
                "Fn::Sub": "arn:aws:ec2:${AWS::Region}:${AWS::AccountId}:instance/${EC2ID}"
            },
            "Action": [
                "ec2:StopInstances"
            ]
        }
    ]
}

3. `util.js`, just a NodeJS file

const ec2 = new (require("aws-sdk")).EC2();

const EC2_ID = process.env.EC2_ID;
if (!EC2_ID) {
    throw new Error("EC2_ID unavailable");
}
const TOKEN = process.env.TOKEN;
if (!TOKEN) {
    throw new Error("TOKEN unavailable");
}

exports.ec2 = async (event, method, resultKey) => {
    let tok = (event.queryStringParameters || {}).token;
    if (tok !== TOKEN) {
        return {statusCode: 401};
    }
    let data = await ec2[method]({InstanceIds: [EC2_ID]}).promise();
    return {
        headers: {"Content-Type": "text/plain"},
        body: data[resultKey].map(si => `${si.PreviousState.Name} -> ${si.CurrentState.Name}`).join("\n")
    };
};

Code is pretty simple - we aren't doing much, just validating the incoming token, calling the EC2 API, and returning the state transition result (e.g. running -> stopping) back to the caller as confirmation; e.g. it will appear in the our client's browser window.

(If you were wondering why we didn't add aws-sdk as a dependency despite require()ing it; that's because aws-sdk is already available in the standard NodeJS Lambda environment. No need to bloat up our deployment package with a redundant copy - unless you wish to use some cutting-edge feature or SDK component that was released just last week.)

The better part of the coordinating fat and glue, is in the custom permissions and the template:

4. Custom template

{
  "Parameters": {
    "EC2ID": {
      "Type": "String",
      "Default": ""
    },
    "TOKEN": {
      "Type": "String",
      "Default": ""
    }
  },
  "Resources": {
    "ec2Start": {
      "Properties": {
        "Environment": {
          "Variables": {
            "EC2_ID": {
              "Ref": "EC2ID"
            },
            "TOKEN": {
              "Ref": "TOKEN"
            }
          }
        }
      }
    },
    "ec2Stop": {
      "Properties": {
        "Environment": {
          "Variables": {
            "EC2_ID": {
              "Ref": "EC2ID"
            },
            "TOKEN": {
              "Ref": "TOKEN"
            }
          }
        }
      }
    }
  }
}

Note: If you used some other/custom names for the Lambda code files, two object keys (ec2Start, ec2Stop) under Resources would be different - it's always better to double-check with the auto-generated template and ensure that the merged template also displays the properly-merged final version.

Deriving that one on your own, isn't total voodoo magic either; after writing the rest of the project, just have a look at the auto-generated template tab, and write up a custom JSON - whose pieces would merge themselves into the right places, yielding the expected final template.

We accept the EC2ID and TOKEN as parameters, and merge them into the Environment.Variables property of the Lambda definitions. (The customized IAM policies are already referencing the parameters via Fn::Sub so we don't need to do anything for them here.)

Once we have the template editor in the free tier, you would certainly have much more cool concepts to play around with - and probably also figure out so many bugs (full disclaimer: I was the one that initially wrote that feature!) which you would promptly report to us! 🤗

Ready to Deploy

When all is ready, click Deploy Project on the toolbar (or Project menu).

(If you came here on the fast-track (by directly opening my project from GitHub), Sigma may prompt you to enter values for the EC2_ID and TOKEN environment variables - just enter some dummy values; we are changing them later anyways.)

If all goes well, Sigma will build the project and deploy it, and you would end up with a Changes Summary popup with an outputs section at the bottom containing the URLs of your API Gateway endpoints.

If you accidentally closed the popup, you can get the outputs back via the Deployment tab of the Project Info window.

Copy both URLs - you would be sending these to your client.

Sigma's work is done - but we're not done yet!

Update the parameters to real values

Grab the EC2-generated identifier of your instance, and find a suitable value for the auth token (perhaps a uuid -v4?).

Via AWS CLI

If you have AWS CLI - which is really awesome, by the way - the next step is just one command; as mentioned in the README as well:

aws cloudformation update-stack --stack-name ec2-control-Stack \
  --use-previous-template --capabilities CAPABILITY_IAM --parameters \
  ParameterKey=EC2ID,ParameterValue=i-0123456789abcdef \
  ParameterKey=TOKEN,ParameterValue=your-token-goes-here

(If you copy-paste, remember to change the parameter values!)

We tell CloudFormation "hey, I don't need to change my deployment definitions but want to change the input parameters; so go and do it for me".

The update usually takes just a few seconds; if needed, you can confirm its success by calling aws cloudformation describe-stacks --stack-name ec2-control-Stack and checking the Stacks.0.StackStatus field.

Via the AWS Console

If you don't have the CLI, you can still do the update via the AWS Console; while it is a bit overkill, the console provides more intuitive (and colorful) feedback regarding the progress and success of the stack update.

Complete the URLs - plus one round of testing

Add the token (?token=the-token-you-picked) to the two URLs you copied from Sigma's deployment outputs. Now they are ready to be shared with your client.

1. Test: starting up

Finally, just to make sure everything works (and avoid any unpleasant or awkward moments), open the starter-up URL in your browser.

Assuming your instance was already stopped, you would get a plaintext response:

stopped -> pending

Within a few seconds, the instance will enter running status and become ready (obviously, this transition won't be visible to the user; but that shouldn't really matter).

2. Test: stopping

Now open the stopper URL:

running -> stopping

As before, stopped status will be reached in background within a few seconds.

0. Test: does it work without the token - hopefully not?

The "unauthorized" response doesn't have a payload, so you may want to use curl or wget to verify this one:

janaka@DESKTOP-M314LAB:~ curl -v https://foobarbaz0.execute-api.us-east-1.amazonaws.com/ec2/stop
*   Trying 13.225.2.77...
* ...
* SSL connection using TLS1.2 / ECDHE_RSA_AES_128_GCM_SHA256
* ...
* ALPN, server accepted to use http/1.1

> GET /ec2/stop HTTP/1.1
> Host: foobarbaz0.execute-api.us-east-1.amazonaws.com
> User-Agent: curl/7.47.0
> Accept: */*
>

< HTTP/1.1 401 Unauthorized
< Content-Type: application/json
< Content-Length: 0
< Connection: keep-alive
< Date: Thu, 18 Jun 2020 06:14:58 GMT
< x-amzn-RequestId: ...

All good!

Now go ahead - share just those two token-included URLs with your client - or whatever third party that you wish to delegate the EC2 control; and ask them to use 'em wisely and keep 'em safe.

If the third party loses the URL(s); and the bad guy who got them, starts playing with them unnecessarily (stopping and starting things rapidly - or at random hours - for example): just run an aws cloudformation update-stack with a new TOKEN - to cut out old access! Then share the new token with your partner, obviously warning them to be a lot more careful.

You can also tear down the whole thing in seconds - without a trace of existence (except for the CloudWatch logs from previous runs) - via:

Sigma's Undeploy Project toolbar button or Project menu item,
aws cloudformation delete-stack on the CLI, or
the AWS console.

Lastly, don't forget to stay tuned for more serverless bites, snacks and full-course meals from our team!

Serverless Monitoring: What do we Monitor when the Server Goes Away?

Originally written for The SLAppForge Blog; Feb 17, 2020

Monitoring your serverless application is crucial - especially while it is handling your production load. This brings us to today's topic, how to effectively monitor a serverless application.

Serverless = Ephemeral

Serverless environments are inherently ephemeral; once the execution completes, you don't have much left behind to investigate.

There's so much exciting talk about container reuse and keep-warm, but theoretically every single one of your function could be a cold-start. Even with reuse, you don't get access to the environment to analyze the previous invocation, until the next one comes in.

So, to effectively monitor a serverless system, we have to gather as much data as possible - while it is actually handling a request.

Monitoring the Ephemeral. The Dynamic. The Inaccessible.

Serverless offers the benefit of less management, through higher levels of abstraction; for obvious reasons, that comes hand in hand with the caveat of less visibility.

In short, you pretty much have to either:

depend on what the serverless platform provider discloses to you, or
write your own monitoring utilities (telemetry agents, instrumentations etc.) to squeeze out more metrics from the runtime

Log Analysis

Logs are the most common means of monitoring, auditing and troubleshooting traditional applications; not surprisingly, it holds true in serverless as well.

What's in a Log?

All major serverless platforms offer comprehensive logging for their FaaS elements: CloudWatch Logs for AWS Lambda, StackDriver Logging for Google Cloud Functions, Azure Monitor Logs for Azure Functions, CloudMonitor for Alibaba Cloud Functions, and so forth.

The logs usually contain:

execution start marker, with a unique ID for each request/invocation
all application logs generated during execution, up until the point of completion of the invocation
execution summary: duration, resources used, billed quota etc.

Serverless is not just FaaS; and logging is not just about function executions. Other services like storage, data-storage and networking also provide their share of logging. While these are mostly associated with access and security auditing (e.g. CloudTrail auditing), they can still be merged with FaaS logs to enable more verbose, detailed monitoring.

Distributed Logging: A Nightmare?

One problem with the inherently distributed nature of serverless systems, is that these logs are usually scattered all over the place; unlike in a traditional monolith where all logs would be neatly arranged in one or a few well-known files.

Imagine your backend consists of five Lambdas, which the web client invokes in a particular sequence to get a job done - or are coordinated through Step Functions or Destinations; the logs for a single "call" could span five log streams in five log groups.

It may sound simple enough during development; but when your production app goes viral and starts receiving hundreds of concurrent calls, tracking a single client journey through the logs could become harder than finding a needle in a haystack.

Log Aggregation

This is where log aggregation services come into play. (In fact, serverless was lucky - because log management had already received a boost, thanks to microservice architecture.) Services like Coralogix and Dashbird will ingest your logs (via push or pull) and allow you to perform filters, aggregations, summarizations etc. as if they were from one or a few sources.

With visibility to long-term data, aggregation services can - and do - actually provide more intelligent outputs; such as real-time alerts on predefined error levels/codes, and stability- or security-oriented anomaly detection through pattern recognition, machine learning etc.

Resource Usage

Even with your application logic running like clockwork, the system could start failing if it is under-provisioned and runs out of resources; or become a budget-killer if over-provisioned.

Additionally, unusual patterns in resource usage may also indicate anomalies in your applications - such as attacks, misuse and other exploits.

What can we Measure?

memory usage
execution time or latency: given that FaaS invocations are under strict timeout constraints, time becomes a critical resource. You do not want your function to time-out before completing its job; but you also do not want it to remain hung indefinitely over a bad database connection that is going to take two minutes to time-out on its own.
compute power used: in many platforms, allocated compute power grows in proportion to memory, so the product of allocated memory and execution time is a good relative measure for the total compute power consumed by the request. In fact, most platforms actually bill you by GB-seconds where GB refers to memory allocation.

Resource Isolation

Serverless invocations are isolated, which means one failure cannot affect another request. Sadly, it also means that each runtime instance should be able to handle the largest possible input/request on its own, as there is virtually no resource sharing across instances; one request handler cannot "borrow" memory from another, as is the case in monolithic apps with multiple services on the same runtime/VM.

This is bad in a sense, that it denies you of the luxury to maintain shared resources (e.g. connection pools, memory-mapped file caches, etc.) to be pooled across requests. But at the same time, it means that managing and monitoring per-request resources becomes easier; each time you allocate or measure it, you are doing it for a single request.

From Logs

As mentioned before, serverless logs usually contain execution summaries stating the allocated vs. maximum memory usages, and exact vs. billed execution time.

From Runtime Itself

Your application runs as a Linux process, so it is always possible to grab resource usage data from the runtime itself - either via standard language/runtime calls (e.g. NodeJS process.memoryUsage()) or by directly introspecting OS-level semantics (e.g. /proc/self/stat). Some aspects like execution time are also provided by the serverless runtime layer - such as remaining time via context.getRemainingTimeInMillis() on Lambda.

From Serverless Platform Metrics

Most platforms keep track of resource metrics themselves. AWS CloudWatch Metrics is perhaps the best example.

While CloudWatch does not yet offer memory or compute graphs, third party tooling like SigmaDash can compute them on your behalf.

Platforms usually provide better metrics for non-FaaS systems than they do for FaaS; probably because of the same challenges they face in accurately monitoring the highly dynamic environment. But they are constantly upgrading their techniques, so we can always hope for better.

For non-compute services like storage (e.g. AWS S3), persistence (e.g. AWS DynamoDB) and API hosting (e.g. AWS API Gateway), platforms offer detailed metrics. The upside with these built-in metrics is that they offer different granularity levels, and basic filtering and summarization, out of the box. The downside, often, is that they are fairly non-real-time (generally several seconds behind on the actual request).

Also, for obvious reasons, longer analysis periods mean lesser granularity. Although you can request for higher-precision data, or report custom metrics at desired precisions, it will cost you more - and, at serverless scales, such costs can add up fairly quickly.

Invocations, Errors and Throttles

While logs and resource usages can reveal anomalies during execution, these depict more straightforward serverless monitoring metrics related to the actual end results of application invocations. In that sense, resource monitoring can be treated as oriented more towards performance, whereas error/throttle monitoring leans more towards correctness, stability and scalability.

You can usually grab these numbers from the built-in metrics of the platform itself.

Invocations-to-error Ratio

Similar to signal-to-noise ratio (SNR), this is a measure of how "efficient" or "robust" your application is; the higher it is, the more requests your application can successfully serve without running into an error.

Of course, the validity of this serverless monitoring metric depends on the actual nature and intent of your code, as well as the nature of inputs; if you receive lots of erratic or failure-prone inputs, and you are actually supposed to fail the invocation in such cases, this value would be naturally high.

Throttles

Presence of throttles could indicate one of a few things:

You have (perhaps mistakenly) throttled your function (or the API/interface) below its regular concurrency limit.
Your application is under a denial-of-service (DoS) attack.
Your application's scalability goes beyond what the serverless platform can occur (maybe it has gone viral); you need to re-architect it to batch-up invocations, introduce back-offs to safely deal with rejections, etc.

Note that throttling issues are not always limited to front-facing elements; if you are asynchronously triggering a function from an internal data stream (e.g. a Lambda with a Kinesis trigger) you could run into throttling, based on the dynamics of the rest of the application.

Also, throttling is not always in front of FaaS. If a third-party API that your application invokes (e.g. Amazon's own seller APIs) gets throttled, you could run into issues even without an apparent increase in the fronting traffic, events or invocation counts. Worse, it could lead to chain reactions or cascading failures - where throttling of one service can flow down the execution path to others, and eventually to the user-facing components.

Once you are on the upper ends of scalability, only a good combination of API knowledge, fail-safe measures and serverless monitoring techniques - logs, throttles and runtime latency measurements - can save you from such scenarios.

Instrumenting

So far, all serverless monitoring techniques we discussed, have been dependent on what the runtime and platform offer by default, or out-of-shelf. However, people find these inadequate for monitoring serious production applications where:

service health and performance are critical,
early detection of anomalies and threats is vital, and
alerts and recovery actions need to be as fine-grained and real-time as possible.

To get beyond what the platform offers, you usually need to instrument your runtime and grab additional serverless monitoring insights. This can be via:

wrapper code and/or additional libraries (e.g. SignalFX or lambda-monitor),
installation of metrics-reporter agents (e.g. New Relic APM), or
modification of the runtime itself (e.g. via layers or custom runtimes, as done by IOpipe, DataDog and Thundra).

For FaaS, this obviously comes with implications on performance - your function runtime now has to spare some cycles to gather and report the metrics; in many cases, the tools also require you to modify your application code. Often these changes are subtle enough - such as importing a library, calling a third-party method, etc.; nevertheless, it goes against the developers' and dev-ops' pipe dream of code-free monitoring.

Realizing the demand for advanced serverless monitoring and diagnostics, cloud platforms themselves have also come forth with instrumentation options:

AWS X-Ray ships with Lambda by default, and can be enabled via a simple configuration with no code change. Once enabled, it can report the latencies of different invocation phases: initialization, individual AWS API/SDK calls, etc. With a bit of custom code it can also capture calls to any downstream HTTP service - which can come in handy when monitoring and troubleshooting latency issues.
Google Cloud Functions offers StackDriver Monitoring for advanced analytics via built-in timing, invocation and memory metrics; as well as custom metrics reporting and alerting.
Azure Functions has its own Azure Application Insights service offering performance analytics .

However, the best work related to instrumentation have come from third-party providers, such as the ones mentioned earlier. These usually require you to subscribe to their services, configure your serverless platforms to report metrics to theirs, and provide analysis results and insights intelligence through their own dashboards.

Serverless Monitoring, up against Privacy and Security Concerns

The obvious problem with delegating serverless monitoring responsibilities to third parties, is that you need to guarantee the safety of the data you are sharing; after all, it is your customers' data.

Even if you implicitly trust the third party, you need to take precautions to minimize the damage in case the monitoring data gets leaked - for the sake of your users as well as the security and integrity of your application.

Anonymizing and screening your logs for sensitive content,
using secure channels for metrics sharing, and
always sticking to the principle of least privilege during gathering, sharing and analysis of data

will get you started on a robust yet safe path to efficiently monitoring your serverless application.

Friday, November 29, 2019

Google Cloud has the fastest build now - a.k.a. හූ හූ, AWS!

SLAppForge Sigma cloud IDE - for an ultra-fast serverless ride! (courtesy of Pexels)

Building a deployable serverless code artifact is a key functionality of our SLAppForge Sigma cloud IDE - the first-ever, completely browser-based solution for serverless development. The faster the serverless build, the sooner you can get along with the deployment - and the sooner you get to see your serverless application up and running.

AWS was "slow" too - but then came Quick Build.

During early days, back when we supported only AWS, our mainstream build was driven by CodeBuild. This had several drawbacks; it usually took 10-20 seconds for the build to complete, and it was rather repetitive - cloning the repo and downloading dependencies each time. Plus, you only get 100 free build minutes per month, so it adds a bit of a cost - despite small - to ourselves, as well as to our users.

Then we noticed that we only need to modify the previous build artifact in order to get the new code rolling. So I wrote a "quick build"; basically a Lambda that downloads the last build artifact, updates the zipfile with the changed code files, and re-uploads it as the current artifact. This was accompanied by a "quick deploy" that directly updates the code of affected functions, thereby avoiding the overhead of a complete CloudFormation deployment.

Then our ex-Adroit wizard Chathura built a test environment, and things changed drastically. The test environment (basically a warm Lambda, replicating the content of the user's project) already had everything; all code files and dependencies, pre-loaded. Now "quick build" was just a matter of zipping everything up from within the test environment itself, and uploading it to S3; just one network call instead of two.

GCP build - still in stone age?

When we introduced GCP support, the build was again based on their Cloud Build, a.k.a. Container Builder service. Although GCP did offer 3600(!) free build minutes per month (120 each day; see what I'm talking, AWS?), theirs was generally slower than CodeBuild. So, for several months, Sigma's GCP support had the bad reputation of having the slowest build-deployment cycle.

But now, it is no longer the case.

Wait, what? It only needs code - no dependencies?

There's a very interesting characteristic of Cloud Functions:

When you deploy your function, Cloud Functions installs dependencies declared in the package.json file using the npm install command.

-Google Cloud Functions: Specifying dependencies in Node.js

This means, for deploying, you just have to upload a zipfile containing the sources and a dependencies file (package.json, requirements.txt and the like). No more npm install, or megabyte-sized bundle uploads.

But, the coolest part is...

... you can do it completely within the browser!

`jszip` FTW!

That awesome jszip package does it all for us, in just a couple lines:

let zip = new JSZip();

files.forEach(file => zip.file(file.name, file.code));

/*
a bit more complex, actually - e.g. for a nested file 'folder/file'
zip.folder(folder.name).file(file.name, file.code)
*/

let data = await zip.generateAsync({
 type: "string",
 streamFiles: true,
 compression: "DEFLATE"
});

We just zip up all code files in our project, plus the Node/npm package.json and/or Python/pip requirements.txt...

...and upload them to a Cloud Storage bucket:

let bucket = "your-bucket-name";
let key = "path/to/upload";

gapi.client.request({
 path: `/upload/storage/v1/b/${bucket}/o`,
 method: "POST",
 params: {
  uploadType: "media",
  name: key
 },
 headers: {
  "Content-Type": "application/zip",
  "Content-Encoding": "base64"
 },
 body: btoa(data)
})).then(res => {
 console.debug("GS upload successful", res);

 return {
  Bucket: res.result.bucket,
  Key: res.result.name
 };
});

Now we can add the Cloud Storage object path into our Deployment Manager template right away!

...
{
 "name": "goofunc",
 "type": "cloudfunctions.v1beta2.function",
 "properties": {
  "function": "goofunc",
  "sourceArchiveUrl": "gs://your-bucket-name/path/to/upload",
  "entryPoint": ...
 }
}

So, how fast is it - for real?

jszip runs in-memory and takes just a few millis - as expected.
If it's the first time after the IDE is loaded, the Google APIs JS client library takes a few seconds to load.
After that, it's a single Cloud Storage API call - to upload our teeny tiny zipfile into our designated Cloud Storage bucket sigma-slappforge-{your Cloud Platform project name}-build-artifacts!
If the bucket is not yet available, and the upload fails as a result, we have two more steps - create the bucket and then re-run the upload. This happens only once in a lifetime.

So for a routine serverless developer, skipping steps 2 and 4, the whole process takes around just one second - the faster your network, the faster it all is!

In comparison to AWS builds, where we want to first run a dependency sync and then a build (each of which is preceded by HTTP OPTIONS requests, thanks to CORS restrictions); this is lightning fast!

(And yeah, this is one of those places where the googleapis client library shines; high above aws-sdk.)

Enough reading - let's roll!

I am a Google Cloud fan by nature - perhaps because my "online" life started with Gmail, and my "cloud dev" life started with Google Apps Script and App Engine. So I'm certainly at bias here.

Still, when you really think about it, Google Cloud is way simpler far more organized than AWS. While this could be a disadvantage when it comes to advanced serverless apps - say, "how do I trigger my Cloud Function periodically?" - GCF is pretty simple, easy and fast. Very much so, when all you need is a serverless HTTP endpoint (webhook) or bucket/queue consumer up and running in a few minutes.

And, when you do that with Sigma IDE, that few minutes could even drop down to a matter of seconds - thanks to the brand new quick build!

So, why waste time reading this - when you can just go and do it right away?!

Friday, April 20, 2018

Sigma the Serverless IDE: resources, triggers, and heck, operations

With serverless, you stopped caring about the server.

With Sigma, you stopped (or will stop, if not already) about the platform.

Now all you care about is your code - the bliss of every programmer.

Or is it?

I hold her (the code) in my arms.

If you have done time with serverless frameworks, you would already know how they take away your platform-phobia, abstracting out the platform-specific bits of your serverless app.

And if you have already tried out Sigma, you would have noticed how it takes things further, relieving you of the burden of the configuration and deployment aspects as well.

Sigma for a healthier dev life!

Leaving behind just the code.

Just the beautiful, raw code.

So, what's the catch?

Okay. Now for the untold, unspoken, not-so-popular part.

You see, my friend, every good thing comes at a price.

Lucky for you, with Sigma, the price is very affordable.

Just a matter of sticking to a few ground rules while you develop your app. That's all.

Resources, resources.

All of Sigma's voodoo depends on one key thing: resources.

Resources, resources!

The concept is quite simple: every piece of your serverless app - may it be a DynamoDB table, S3 bucket or SNS topic - is a resource from Sigma's point of view.

If you remember the Sigma UI, the Resources pane on the left contains different resource types that you can have in your serverless app. (True, it's pretty short; but we're working on it :))

Resources pane in Sigma UI

Behind the scenes

When you drag a resource from this pane, into your code, Sigma secretly creates a resource (which it would later deploy into your serverless provider) to track the configurations of the actual service entity (say, the S3 bucket that should exist in AWS by the time your function is running) and all its usages within your code. The tracking is fully automated; frankly, you didn't even want to know about that.

Sigma tracks resources in your app, and deploys them into the underlying platform!

"New" or "existing"?

On almost all of Sigma's resource configuration pop-ups, you may have noticed two options: "new" vs "existing". "New" resources are the ones that would be (or have already been) created as a result of your project, whereas "existing" ones are those which have been created outside of your project.

Now that's a tad bit strange because we would usually use "existing" to denote things that "exist", regardless of their origin - even if they came from Mars.

Better brace yourself, because this also gives rise to a weirder notion: once you have deployed your project, the created resources (which now "exist" in your account) are still treated by Sigma as "new" resources!

And, as if that wasn't enough, this makes the resources lists in Sigma behave in totally unexpected ways; after you define a "new" resource, whenever you want to reuse that resource somewhere else, you would have to look for it under the "existing" tab of the resource pop-up; but it will be marked with a " (new)" prefix because, although it is already defined, it remains "new" from Sigma's point of view.

Now, how sick is that?!

Bang head here.

Perhaps we should have called them "Sigma" resources; or perhaps even better, "project" resources; while we scratch our heads, feel free to chip in and help us with a better name!

Rule o' thumb

Until this awkwardness is settled, the easiest way to get through this mess is to stick to this rule of thumb:

If you added a resource to your current Sigma project, Sigma would treat it as a "new" resource till the end of eternity.

Bottom line: no worries!

Being able to use existing resources is sometimes cool, but it means that your project would be much less portable. Sigma will always assume that the resources referenced by your project are already in existence, regardless of whatever AWS account you attempt to deploy it. At least until (if) we (ever) come up with a different resource management mechanism.

If you want portability, always stick to new resources. That way, even if a complete stranger gets hold of your project and deploys it in his own, alien, unheard-of AWS account, the project would still work.

If you are integrating with an already existing set of resources (e.g. the set of S3 bucket in your already-running dev/test/prod environment), using existing resources is the obvious (and the most convenient) choice.

Anyways, back to our discussion:

Where were we?

Ah, yes. Resources.

The secret life of resources

In a serverless app, you basically use resources for two things:

for triggering the app (as an event source, a.k.a. trigger)
for performing work inside the app, such as invoking external services

triggers and operations

Resources? Triggers?? Operations???

Sigma also associates its resources with your serverless app in a similar fashion:

A trigger is responsible of, well, triggering the function, and so is associated with the event variable of the function. A good example is an API Gateway endpoint with one of its methods linked to our function via an integration.
An operation is - you guessed it! - an action that can be performed on or using an entity, such as an insert into a DynamoDB table.

In Sigma, a function can have several triggers (as long as the application itself is aware of tackling different trigger event types!), and can contain several operations (obviously).

Yet, they're different.

It is noteworthy that a resource itself is not a trigger or an operation; triggers and operations are associated with resources (they kind of "bridge" functions and resources) but a resource has its own independent life. As a result, a resource can power many triggers (to be precise, zero or more) and get involved in many operations, across many (again, zero or more) functions.

A good example is S3. If you want to write an image resizer function that would pick and process images dropped into a S3 bucket, you would configure a S3 trigger to invoke the function upon the file drop, and a S3 GetObject operation to retrieve and process the file; however, both will point to the same S3 resource, namely the bucket where images are being dropped into and fetched from.

Launch time!

At deployment, Sigma will take care of putting the pieces together - trigger configs, runtime permissions and whatnot - based on which function is associated with which resources, and in which ways (trigger-mode vs operation-mode). You can simply drag, drop and configure your buckets, queues and stuff, write your code, and totally forget about the rest!

That's the beauty of Sigma.

When a resource is "abandoned" (meaning that it is not used in any trigger or operation), it shows up in the "unused resources" list (remember the dustbin button on the toolbar?) and can be removed from the project; remember that if you do this, provided that the resource is a "new" one (rule of thumb: one created in Sigma), it will be automatically removed from your serverless provider account (for example, AWS) during your next deployment!

So there!

if Sigma's resource model (the whole purpose of this article) looks like a total mess-up to you, feel free to raise your voice on StackOverflow - or better still, our GitHub space, FB page or Twitter feed; we would appreciate it very much!

Of course, Sigma has nothing to hide; if you check your AWS account after a few Sigma deployments, you would realize the things we have been doing under the hood.

All of it, to make your serverless journey as smooth as possible.

And easy.

And fun. :)

Welcome to the world of Sigma!

Thursday, April 19, 2018

Sigma QuickBuild: Towards a Faster Serverless IDE

TL;DR

The QuickBuild/QuickDeploy feature described here is pretty much obsoleted by the test framework (ingeniously hacked together by @CWidanage), that gives you a much more streamlined dev-test experience with much better response time!

In case you hadn't noticed, we have recently been chanting about a new Serverless IDE, the mighty SLAppForge Sigma.

With Sigma, developing a serverless app becomes as easy as drag-drop, code, and one-click-Deploy; no getting lost among overcomplicated dashboards, no eternal struggles with service entities and their permissions, no sailing through oceans of docs and tutorials - above all that, nothing to install (just a web browser - which you already have!).

So, how does Sigma do it all?

In case you already tried Sigma and dug a bit deeper than just deploying an app, you may have noticed that it uses AWS CodeBuild under the hood for the build phase. While CodeBuild gives us a fairly simple and convenient way of configuring and running builds, it has its own set of perks:

CodeBuild takes a significant time to complete (sometimes close to a minute). This may not be a problem if you just deploy a few sample apps, but it can severely impair your productivity - especially when you begin developing your own solution, and need to reflect your code updates every time you make a change.
The AWS Free Tier only includes 100 minutes of CodeBuild time per month. While this sounds like a generous amount, it can expire much faster than you think - especially when developing your own app, in your usual trial-and-error cycles ;) True, CodeBuild doesn't cost much either ($0.005 per minute of build.general1.small), but why not go free while you can? :)

Options, people?

Lambda, on the other hand, has a rather impressive free quota of 1 million executions and 3.2 million seconds of execution time per month. Moreover, traffic between S3 and Lambda is free as far as we are concerned!

Oh, and S3 has a free quota of 20000 reads and 2000 writes per month - which, with some optimizations on the reads, is quite sufficient for what we are about to do.

2 + 2 = ...

So, guess what we are about to do?

Yup, we're going to update our Lambda source artifacts in S3, via Lambda itself, instead of CodeBuild!

Of course, replicating the full CodeBuild functionality via a lambda would need a fair deal of effort, but we can get away with a much simpler subset; read on!

The Big Picture

First, let's see what Sigma does when it builds a project:

prepare the infra for the build, such as a role and an S3 bucket, skipping any that already exist
create a CodeBuild project (or, if one already exists, update it to match the latest Sigma project spec)
invoke the project, which will:
- download the Sigma project source from your GitHub repo,
- run an npm install to populate its dependencies,
- package everything into a zip file, and
- upload the zip artifact to the S3 bucket created above
monitor the project progress, and retrieve the URL of the uploaded S3 file when done.

And usually every build has to be followed by a deployment; to update the lambdas of the project to point to the newly generated source archive; and that means a whole load of additional steps!

create a CloudFormation stack (if one does not exist)
create a changeset that contains the latest updates to be published
execute the changeset, which will, at the least, have to:
- update each of the lambdas in the project to point to the new source zip file generated by the build, and
- in some cases, update the triggers associated with the modified lambdas as well
monitor the stack progress until it gets through with the update.

All in all, well over 60-90 seconds of your precious time - all to accommodate perhaps just one line (or how about one word, or one letter?) of change!

Can we do better?

At first glance, we see quite a few redundancies and possible improvements:

Cloning the whole project source from scratch is overkill, especially when only a few lines/files have changed.
Every build will download and populate the NPM dependencies from scratch, consuming bandwidth, CPU cycles and build time.
The whole zip file is now being prepared from scratch after each build.
Since we're still in dev, running a costly CF update for every single code change doesn't make much sense.

But since CodeBuild invocations are stateless and CloudFormation's resource update logic is mostly out of our hands, we don't have the freedom to meddle with many of the above; other than simple improvements like enabling dependency caching.

Trimming down the fat

However, if we have a lambda, we have full control over how we can simplify the build!

If we think about 80% - or maybe even 90% - of the cases for running a build, we see that they merely involve changes to application logic (code); you don't add new dependencies, move your files around or change your repo URL all the time, but you sure as heck would go through an awful lot of code edits until your code starts behaving as you expect it to!

And what does this mean for our build?

80% - or even 90% - of the time, we can get away by updating just the modified files in the lambda source zip, and updating the lambda functions themselves to point to the updated file!

Behold, here comes QuickDeploy!

And that's exactly what we do, with the QuickBuild/QuickDeploy feature!

Lambda to the rescue!

QuickBuild uses a lambda (deployed in your own account, to eliminate the need for cross-account resource access) to:

fetch the latest CodeBuild zip artifact from S3,
patch the zip file to accommodate the latest code-level changes, and
upload the updated file back to S3, overriding the original zip artifact

Once this is done, we can run a QuickDeploy which simply sends an UpdateFunctionCode Lambda API call to each of the affected lambda functions in your project, so that they can scoop up the latest and greatest of your serverless code!

And the whole thing does not take more than 15 seconds (give or take the network delays): a raw 4x improvement in your serverless dev workflow!

A sneak peek

First of all, we need a lambda that can modify an S3-hosted zip file based on a given set of input files. While it's easy to make with NodeJS, it's even easier with Python, and requires zero external dependencies as well:

Here we go... Pythonic!

import boto3

from zipfile import ZipFile, ZipInfo, ZIP_DEFLATED

s3_client = boto3.client('s3')

def handler(event, context):
  src = event["src"]
  if src.find("s3://") > -1:
    src = src[5:]
  
  bucket, key = src.split("/", 1)
  src_name = "/tmp/" + key[(key.rfind("/") + 1):]
  dst_name = src_name + "_modified"
  
  s3_client.download_file(bucket, key, src_name)
  zin = ZipFile(src_name, 'r')
  
  diff = event["changes"]
  zout = ZipFile(dst_name, 'w', ZIP_DEFLATED)
  
  added = 0
  modified = 0
  
  # files that already exist in the archive
  for info in zin.infolist():
    name = info.filename
    if (name in diff):
      modified += 1
      zout.writestr(info, diff.pop(name))
    else:
      zout.writestr(info, zin.read(info))
  
  # files in the diff, that are not on the archive
  # (i.e. newly added files)
  for name in diff:
    info = ZipInfo(name)
    info.external_attr = 0755 << 16L
    added += 1
    zout.writestr(info, diff[name])
  
  zout.close()
  zin.close()
  
  s3_client.upload_file(dst_name, bucket, key)
  return {
    'added': added,
    'modified': modified
  }

We can directly invoke the lambda using the Invoke API, hence we don't need to define a trigger for the function; just a role with S3 full access permissions would do. (We use full access here because we would be reading from/writing to different buckets at different times.)

CloudFormation, you beauty.

From what I see, the coolest thing about this contraption is that you can stuff it all into a single CloudFormation template (remember the lambda command shell?) that can be deployed (and undeployed) in one go:

AWSTemplateFormatVersion: '2010-09-09'
Resources:
  zipedit:
    Type: AWS::Lambda::Function
    Properties:
      FunctionName: zipedit
      Handler: index.handler
      Runtime: python2.7
      Code:
        ZipFile: >
          import boto3
          
          from zipfile import ZipFile, ZipInfo, ZIP_DEFLATED
          
          s3_client = boto3.client('s3')
          
          def handler(event, context):
            src = event["src"]
            if src.find("s3://") > -1:
              src = src[5:]
            
            bucket, key = src.split("/", 1)
            src_name = "/tmp/" + key[(key.rfind("/") + 1):]
            dst_name = src_name + "_modified"
            
            s3_client.download_file(bucket, key, src_name)
            zin = ZipFile(src_name, 'r')
            
            diff = event["changes"]
            zout = ZipFile(dst_name, 'w', ZIP_DEFLATED)
            
            added = 0
            modified = 0
            
            # files that already exist in the archive
            for info in zin.infolist():
              name = info.filename
              if (name in diff):
                modified += 1
                zout.writestr(info, diff.pop(name))
              else:
                zout.writestr(info, zin.read(info))
            
            # files in the diff, that are not on the archive
            # (i.e. newly added files)
            for name in diff:
              info = ZipInfo(name)
              info.external_attr = 0755 << 16L
              added += 1
              zout.writestr(info, diff[name])
            
            zout.close()
            zin.close()
            
            s3_client.upload_file(dst_name, bucket, key)
            return {
                'added': added,
                'modified': modified
            }
      Timeout: 60
      MemorySize: 256
      Role:
        Fn::GetAtt:
        - role
        - Arn
  role:
    Type: AWS::IAM::Role
    Properties:
      ManagedPolicyArns:
      - arn:aws:iam::aws:policy/service-role/AWSLambdaBasicExecutionRole
      - arn:aws:iam::aws:policy/AmazonS3FullAccess
      AssumeRolePolicyDocument:
        Version: 2012-10-17
        Statement:
        - Action: sts:AssumeRole
          Effect: Allow
          Principal:
            Service: lambda.amazonaws.com

Moment of truth

Once the stack is ready, we can start submitting our QuickBuild requests to the lambda!

// assuming auth stuff is already done
let lambda = new AWS.Lambda({region: "us-east-1"});

// ...

lambda.invoke({
  FunctionName: "zipedit",
  Payload: JSON.stringify({
    src: "s3://bucket/path/to/archive.zip",
    changes: {
      "path/to/file1/inside/archive": "new content of file1",
      "path/to/file2/inside/archive": "new content of file2",
      // ...
    }
  })
}, (err, data) => {
  let result = JSON.parse(data.Payload);
  let totalChanges = result.added + result.modified;
  if (totalChanges === expected_no_of_files_from_changes_list) {
    // all izz well!
  } else {
    // too bad, we missed a spot :(
  }
});

Once QuickBuild has completed updating the artifact, it's simply a matter of calling UpdateFunctionCode on the affected lambdas, with the S3 URL of the artifact:

lambda.updateFunctionCode({
  FunctionName: "original_function_name",
  S3Bucket: "bucket",
  S3Key: "path/to/archive.zip"
})
.promise()
.then(() => { /* done! */ })
.catch(err => { /* something went wrong :( */ });

(In our case the S3 URL remains unchanged (because our lambda simply overwrites the original file), but it still works because the Lambda service makes a copy of the code artifact when updating the target lambda.)

To speed up the QuickDeploy for multiple lambdas, we can even parallelize the UpdateFunctionCode calls:

Promise.all(
  lambdaNames.map(name =>
    lambda.updateFunctionCode({ /* params */ })
    .promise()
    .then(() => { /* done! */ }))

.then(() => { /* all good! */ })
.catch(err => { /* failures; handle them! */ });

And that's how we gained an initial 4x improvement in our lambda deployment cycle, sometimes even faster than the native AWS Lambda console!

Wednesday, June 24, 2020

One Bite of Real-world Serverless: Controlling an EC2 with Lambda, API Gateway and Sigma

The Situation: a third party needs to control one of our EC2 instances

Managing the EC2 across time zones: the Options

Client-controlled EC2: how to KISS it, the right way

Serverless to the rescue!

ec2-control: writing it with Sigma

Template Editor FTW

Some premium bad news, and then some free good news

"Old, but not obsolete" (a.k.a. more limitations, but not impossible)

Enough talk, let's code!

1. ec2-start.js, a NodeJS Lambda

API Gateway trigger

Custom permissions tab

2. ec2-stop.js, a NodeJS Lambda

API Gateway trigger

Custom permissions tab

3. util.js, just a NodeJS file

4. Custom template

Ready to Deploy

Update the parameters to real values

Via AWS CLI

Via the AWS Console

Complete the URLs - plus one round of testing

1. Test: starting up

2. Test: stopping

0. Test: does it work without the token - hopefully not?

All good!

Serverless Monitoring: What do we Monitor when the Server Goes Away?

Serverless = Ephemeral

Monitoring the Ephemeral. The Dynamic. The Inaccessible.

Log Analysis

What's in a Log?

Distributed Logging: A Nightmare?

Log Aggregation

Resource Usage

What can we Measure?

Resource Isolation

From Logs

From Runtime Itself

From Serverless Platform Metrics

Invocations, Errors and Throttles

Invocations-to-error Ratio

Throttles

Instrumenting

Serverless Monitoring, up against Privacy and Security Concerns

Friday, November 29, 2019

Google Cloud has the fastest build now - a.k.a. හූ හූ, AWS!

AWS was "slow" too - but then came Quick Build.

GCP build - still in stone age?

Wait, what? It only needs code - no dependencies?

jszip FTW!

So, how fast is it - for real?

Enough reading - let's roll!

Friday, April 20, 2018

Sigma the Serverless IDE: resources, triggers, and heck, operations

I hold her (the code) in my arms.

So, what's the catch?

Resources, resources.

Behind the scenes

"New" or "existing"?

Rule o' thumb

Bottom line: no worries!

The secret life of resources

Resources? Triggers?? Operations???

Yet, they're different.

Launch time!

Thursday, April 19, 2018

Sigma QuickBuild: Towards a Faster Serverless IDE

TL;DR

So, how does Sigma do it all?

Options, people?

2 + 2 = ...

The Big Picture

Can we do better?

Trimming down the fat

Behold, here comes QuickDeploy!

Lambda to the rescue!

A sneak peek

Here we go... Pythonic!

`ec2-control`: writing it with Sigma

1. `ec2-start.js`, a NodeJS Lambda

2. `ec2-stop.js`, a NodeJS Lambda

3. `util.js`, just a NodeJS file

`jszip` FTW!