Wednesday, June 24, 2020

Get Notified on Slack for your CloudWatch/Lambda Error Logs: The Sigma Way

Originally written for The SLAppForge Blog; Jun 22, 2020

Slack is already the de-facto "channel" for team collaboration as well as low-to-moderate scale event monitoring and notifications. When we started working on the world's first serverless MFT platform, we decided we should go down the Slack path for first-level monitoring; similar to how we already do it efficiently with our conventional AS2 Gateway.

Not really real-time - and no need to be

When you mention Slack, real-time usually comes to mind; however when it comes to monitoring, real-time often becomes a PITA. Especially when a new user is trying out our platform and making tons of mistakes; or when an AS2 partner endpoint goes down, and our outgoing messages start failing in dozens by the second (all of them are accounted for, so there's really no need - and nothing - to do on our end).

Of course the correct way to handle these critical ("database is down!") vs non-critical (temporary endpoint failure, like above) errors, is via proper prioritization on the reporting end. However, until we have such a mechanism in place, we wanted to receive batched summaries of errors; at a high-enough frequency so that we can still act on them in time (before the user abandons us and moves on!).

Serverless logging: a natural candidate for batched reporting

In AS2 Gateway we didn't have much luck with batching (yet), because the Slack publisher was a standard Log4J2 appender. If we wanted to batch things up, we would have to do it ourselves within the appender - bringing in queueing, and dragging in a host of other issues along with it.

But in MFT Gateway everything is distributed and abstracted out - Lambda, API Gateways, Step Functions, and so on. There is no badass "Slack appender"; all logs go into one sink, CloudWatch, neatly arranged across log groups and streams.

So we need to come up with a relatively passive system (unlike before, where the appender - part of the application itself - does the alerting for us, as soon as the events have occurred). Either CloudWatch has to push events to us, or we have to pull events from them.

Some native, push-based alternatives - that (alas) didn't work

In fact, initially we did research upon possible ways to get CloudWatch - or one of its own supported sinks a.k.a. downstreams - to push alerts to us; but that didn't turn out as good as we thought:

  • CloudWatch Alarms are state-transition-based; once the alarm "rings" (goes red) and fires an alert, it won't ring again until the system returns to an "okay" (green) state. The obvious caveat is that, if multiple errors occur over a short period of time, we would get notified only for the first one. Consequently, if we were to not risk missing out on any errors, we would have to keep on monitoring the logs after each alarm - until the alarm itself goes back green. (We could set up multiple alarms for different kinds of errors; but that's not very scalable with application evolution - and not budget-friendly at all, at $0.30 per each high-resolution alarm).
  • Triggering a Lambda Function via CloudWatch Logs was also an option. But it didn't make much sense (neither from scalability nor from financial perspectives) because it didn't provide a sufficient batching scope (let's say it was "too real-time"); if our original (application) Lambda produced 100 error logs within a minute, we might potentially even end up with 100 monitoring Lambda invocations! (And I'm pretty sure that it wasn't offering filtering as well, the last time we checked 🤔 which would have been a disaster, needless to say.)
  • Draining Logs to Kinesis was another option; however this involves the hourly charges of a Kinesis stream, in addition to the need for an actual processing component (perhaps a Lambda, again) to get alerts into Slack.
  • Aggregating Logs into Elasticsearch was the final one; obviously we didn't want to take that path because it would shamefully add a server component to our serverless platform (true, with AWS it's a managed service; but having to pay a hourly fee took that feeling away from us 😎); besides, running Elasticsearch just to get error alerts sounded a bit overkill as well.

Lambda pull: final call

So we ended up with an old-school, pull-based approach - with the Lambda running once every few minutes, polling a set of log groups for WARN and ERROR logs over the last few minutes (incidentally, since the previous run of the same Lambda), and sending these (if any) to our Slack channel in one big batched-up "report" message.

Good news: you can deploy it on your own AWS account - right now!

We used our own dog food, the Sigma IDE - the one and only serverless IDE, purpose-built for serverless developers. We kept all necessary configuration parameters in environment variables, so that you can simply fill them up with values matching your own infra - and deploy it to your own AWS account in one click.

But the best part is...

...you don't need anything else to do it!

I bet you're reading this article on a web browser;

  1. fire up Sigma in the next tab,
  2. log in and provide your AWS credentials, if not already done,
  3. open the ready-made project by simply entering its GitHub URL https://github.com/janakaud/cloudwatch-slack-alerts,
  4. fill in the environment variables, and
  5. click Deploy Project!

Zero tools to download, packages/dependencies to install, commands to run, config files to edit, bundles to zip-up and upload, ...

But for the inquisitive reader, doesn't that kill the fun part?

Even so, some of you might be curious as to how it all works. Curiosity kills cats, but after all that's what makes us builders and developers; won't you say? 😎

The Making-of: a ground-up overview of the Slack alert reporter

First, a Word of Warning

The permission customization feature used here may not be available for free accounts. However as mentioned before, you can still open the ready-made project from GitHub and deploy it right away - and tweak _any other_ aspects (code, libraries, trigger frequency, function configs etc.) - using a free Sigma account. We do have plans to open up the permission manager and template editor to the free tier; so, depending on the time you are reading this, you may be lucky!

If you need to "just get the damn thing deployed", skip back to the Good news: section - right away!

Preparation

  1. Integrate the Incoming Webhooks app with your Slack workspace.
  2. When integration is complete, grab the webhook integration URL from the final landing page.
  3. Create a suitable Slack channel to receive your alerts.
  4. Make a list of log group names that you need to check for alert-worthy events.
  5. Decide on a pattern to check (filter) the logs for alerts - you can use any of the standard syntaxes supported by CloudWatch Logs API.
  6. Decide how frequently you want to poll the log groups - this will go into the Lambda trigger as well as a code-level "look-behind" parameter which we'll use when calling the API.

The plan

  1. Set up a CloudWatch Events scheduled trigger to run the Lambda at a desired time period
  2. Parallelly poll each log group for new logs matching our pattern, within the look-behind window - which should ideally the same as the trigger period (e.g. if your Lambda runs once in every 5 minutes, checking the last 5 minutes of logs in each cycle should suffice).
  3. Filter out empty results and build a string "report" with events categorized by log group name
  4. Post the report to the Slack channel via the webhook.

Before you start coding

  1. Sign in to Sigma IDE and create a new NodeJS AWS project. (We could have used Python as well, but would have had to handle _parallelism_ ourselves.)
  2. If you don't like the default file name (which gets opened as soon as you land in the editor), change it - and its Function Name - to something else.
  3. Unless you like the native http module, add a dependency to make the webhook call; we picked axios.

Environment variables

Add a few of 'em, as follows:

  • POLL_PERIOD_MS: the "look-behind" period for the log query (in millis) - for the current implementation, it should be the same as what you set for the period of the timer trigger (below).
  • LOG_GROUPS: a space-separated list of log groups that you need to check; if a group is not prefixed with a namespace (e.g. /aws/apigateway/ it will default to /aws/lambda/
  • LOG_PATTERN: the search pattern to "filter in" the alert-worthy logs; ?ERROR ?WARN ("at least one of ERROR and WARN") could be good enough to capture all errors and warnings (depends on your internal logging formats, of course)
  • SLACK_WEBHOOK_URL: speaks for itself; the webhook URL you grabbbed during Preparation
  • SLACK_CHANNEL: again, trivial; the "hashname" (maybe #bada_boom) of the channel you created earlier
  • SLACK_USER: the name of the user (bot) that the alerts would appear to be coming from

There are other cool features supported by the Incoming Webhooks integration; a few small tweaks in the webhook-invocation part, and you could be swimming in 'em right away.

Except the first one, you may need to prevent values of these variables from being persisted into version control when you save the project; in Sigma, you can make their values non-persistent with a simple flip of a switch.

When you reopen the project after a decade, Sigma will automagically pull them in from the deployed Lambdas, and populate the values for you - so you won't need to rack your notes (or brains) to recall the values either!

The timer (CloudWatch scheduled) trigger

  1. Drag-n-drop a CloudWatch Events trigger from the Resources pane on the left, on to the event variable of the function header.
  2. Under New RuleSchedule tab, enter the desired run frequency as a rate expression (e.g. rate(5 minutes)). You can also use a cron if desired, but it may be a bit trickier to compute the look-behind window in that case.
  3. Click Inject.

And now, the cool stuff - the code!

Let's quickly go through the salient bits and pieces:

Poll the log groups asynchronously

This transforms any returned log events (matching our time range and filter pattern) into a formatted "report string":

logGroupName

some log line matching ERROR or WARN (in the group's own logging format)

another line

another one

...

and returns it (or null if nothing was found):

// if not namespaced, add default Lambda prefix
let logGroupName = g.indexOf("/") < 0 ? `/aws/lambda/${g}` : g;

let msg, partial = true;
try {
	let data = await logs.filterLogEvents({
		logGroupName,
		filterPattern,
		startTime,
		limit: 100
	}).promise();
	msg = data.events.map(e => e.message.substring(0, 1000).trim()).join("\n\n");
	partial = !!data.nextToken;
} catch (e) {
	msg = `Failed to poll ${g}; ${e.message}`;
}
return msg.length > 0 ? `\`${g}${partial ? " [partial]" : ""}\`

\`\`\`
${msg}
\`\`\`` : null;

Fan-out and aggregate across log groups

We can poll each log group independently to save time, and merge everything into a final, time-prefixed report - ready to post to Slack:

let checks = groups.map(async g => {
	// the code we saw above
});

return await Promise.all(checks)
	.then(msgs => {
		let valid = msgs.filter(m => !!m);
		if (valid.length > 0) {
			postMsg(`*${timeStr}*

${valid.join("\n\n")}`);
		}
	})
	.catch(e => postMsg(`*FAULT* ${timeStr}

\`\`\`
${e.message}
\`\`\``));
};

Post it to the Slack webhook

const postMsg = text => {
	return axios.post(hook, {
		channel, username,
		text,
		icon_emoji: ":ghost:"
	});
};

Putting it all together

Throw in the imports, client declarations, a bit of sanity (esp. around environment variable loading), and some glue - and you have your full-blown Lambda!

See it in GitHub - in all its raw glamour.

CloudWatch Logs IAM permissions

Since our Lambda is going to access CloudWatch Logs API (for which Sigma does not auto-generate permissions yet), we need to add a permission entry to the Lambda's IAM policy:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Resource": [
        {
          "Fn::Sub": "arn:aws:logs:${AWS::Region}:${AWS::AccountId}:log-group:*"
        }
      ],
      "Action": [
        "logs:FilterLogEvents"
      ]
    }
  ]
}

In this case we've granted access to all log groups; depending on your monitoring scope you may be able to further narrow it down using the ARN pattern on the Resource entry.

If you're following along on the free tier, the Custom Permissions tab may not be editable for you - depending on when you're reading this. Regardless, as mentioned before, that won't prevent you from opening, modifying and deploying the already-tuned-up project from GitHub!

Deploy it!

That one's easy - just click the Deploy Project button on the IDE toolbar!

Within a few seconds, Sigma will show you a deployment progress dialog, and give the green light ("successfully deployed") shortly afterwards - usually in under a minute.

Okay, the Slack reporter is live. And now...

All that's left to do is to sit back and relax - while the Lambda churns your CloudWatch logs and alerts you of any fishy stuff that it finds!

Bonus: some troubleshooting tips, just in case

If you don't receive anything on Slack (even when clearly there are error logs within the last few minutes), or notice some other weird behavior (like duplicate/repeated alerts), it may be time to check the logs of the log-polling Lambda itself 😎

You can just pop the in-IDE SigmaTrail log viewer and view the latest execution logs ([PROD]) of the Lambda, or use the CLI (aws logs filter-log-events --log-group-name /aws/lambda/YourLambdaNameHere) or the web console to view 'em officially from AWS side.

The default code will log a line during each run, indicating the starting timestamp (from which point in time it is checking for new logs, within the current cycle); this could be handy to determine when the function has run, and what "log ground" it has covered.

Stay in touch!

Lambda is cool, and serverless is super-cool - given all the things you can accomplish, even without using Lambda!

So stay tuned for more serverless goodies - from the pioneers in serverless dev-tools 😎

No comments: