[Cloud Architect] 3. Monitor, React, and Recover

AWS provides robust monitoring capabilities for their services. This is vital for understanding how your systems are performing not just at the moment, but also over time. CloudWatch Metrics tracks metrics on AWS services. Any metric that AWS makes available is presented via CloudWatch. You can also create your own metrics with a "custom metric". Taking a variety of related metrics and putting them on a CloudWatch Dashboard is an effective way to gaining visibility into your system without spending a lot of time doing it. Investing in understanding what metrics are available on the services you use, what each metric means, and what your usage is are imperative to running a highly available platform.

Alerting

Proper alerting will help you keep tabs on your systems and will help you meet your SLAs. Alerting in ways that bring attention to important issues will keep everyone informed and prevent your customers from being the ones to inform you of problems. CloudWatch Alarms integrates with CloudWatch Metrics. Any metric in CloudWatch can be used as the basis for an alarm. These alarms are sent to SNS topics, and from there, you have a whole variety of options for distributing information such as email, text message, Lambda invocation or third party integration.

Alerting when problems occur is critical, but alerting when problems are about to occur is far better. Understanding the design and architecture of your platform is key to being able to set thresholds correctly. You want to set your thresholds so that your systems are quiet when the load is within their capacity, but to start speaking up when they head toward exceeding their capacity. You will need to determine how much advanced warning you will need to fix issues.

Recovering From Failure

The key to recovering from failure is to understand how the failure occurred. Once you have this understanding, you can be sure that you've fixed the root cause, and you will know how to prevent a reoccurrence. Finding a root cause can be straightforward is there is a direct cause and effect (we changed A, and B immediately happened). Some issues are harder to identify, and some can only be identified by understanding "what changed?".

CloudTrail is a great tool for determining what changed. It allows you to audit and review changes and commands run with all AWS credentials associated with your account. Once you've discovered what was changed and who/what changed it, you can resolve the issue and ensure that the incident is not repeated.

Who Changed?

Your application
A third party
Something expired:
- SSL certificate
- Licenses

Automating Recovery

Automating service recovery and creating "self-healing" systems can take you to the next level of system architecture. Some solutions are quite simple. Using autoscaling within AWS, you can handle single instance/server failures without missing a beat. These solutions will automatically replace a failed server or will create or delete servers based on the demand at any given point in time.

Beyond the simple tasks, many types of failure can be automatically recovered from, but this can involve significant work. Many failure events can generate notifications, either directly from the service, or via an alarm generated out of CloudWatch. These events can have a Lambda function attached to them, and from there, you can do anything you need to in order to recover the system. Do be cautious with this type of automation where you are, in essence, turning over some control of the platform - to the platform. Just like with a business application, there can be defects. However, as with any software, proper and thorough testing can help ensure a high-quality product.

Edge Cases

Many applications and services lend themselves to being monitored and maintained. When you run into an application that does not, it is no less important (it's like more important) to monitor, alert and maintain these applications. You may find yourself needing to go to extremes in order to pull these systems into your monitoring framework, but if you do not, you are putting yourself at risk for letting faults go undetected. Ensuring coverage of all of the components of your platform, documenting and training staff to understand the platform and practicing what to do in the case of outages will help ensure the highest uptime for your company.

Lesson Recap

Monitoring
Alerting
Recovering
Automating

Lesson Objectives

You will be able to:

Monitor AWS applications
Alert on problems in applications
Recover failures in your platform
Understand testing and tradeoffs in automating recovery from failure

In this lesson, you learned how to monitor and maintain systems in AWS. You also looked at what and how to recover systems that have failed. The larger your application grows, the more parts and services it will have. The more complex it grows, the more things that can go wrong. The more things that can go wrong, the more frequently they will go wrong. Expect failures, and plan to address and recover from them.

Glossary

SSL certificate: Cryptographic certificate for encrypting traffic between two computers.
Source of truth: When data is stored in multiple places or ways, the "source of truth" is the one that is used when there is a discrepancy between the multiple sources.
Monitoring: Systems to track and make visible metrics that are useful in identifying system performance.
Alerting: Systems to attract attention when performance thresholds are crossed.
Chaos Engineering: Intentionally causing issues in order to validate that a system can respond appropriately to problems.

[Cloud Architect] 3. Monitor, React, and Recover

Lesson Outline

Overview

Recovering all your systems

Monitoring in AWS