Resilience & Availability of your AWS Infrastructure

As startups mature and scale so should the robustness and reliability of its cloud infrastructure. Monitoring your digital assets to the cloud can seem overwhelming at times. But you’re not alone. AWS has done a good job of meeting you halfway to help with it, AWS calls it the Shared Responsibility Model. Both you and AWS are each responsible for the health and security of your systems. The goal of this series of blogs is to showcase your responsibility in various requirements of observability and failover solutions that account for the current and projected complexity of organizations Infrastructure. These solutions will collectively provide stakeholders a strong sense of confidence in the reliability and resilience of the company, allowing them to focus their efforts on further growth.

A comprehensive monitoring solution has multiple layers. Monitoring will provide an accurate representation of the system. With monitoring comes alerting and forensics. Suppose the monitoring system just noticed an increased load on a server, how do we alert people? Who do we alert? Why did it break in the first place? Let’s answer these questions in detail for each piece of cloud infrastructure.

In order to build a resilient system that also meets the needs of its stakeholders, we at QPAIR recommend that we put in place processes that will result in a more robust infrastructure.

  1. Gather an accurate representation of the state of the company’s servers and its application components.
  2. Monitor state, accessibility, and dependencies of each system.
  3. Predict the future state of systems (using data to make qualitative decisions to plan for scalability).
  4. Notify responsible parties when systems begin to show signs of overload, dependencies break, or fail altogether.
  5.  Improve processes for reviewing, testing, and deploying new features.
  6. Provide the whole team with the same information regardless of which network they are connecting from.
  7. Establish detailed protocols the team should follow during different types of incidents.
  8. Improve the forensics of your overall application components

It is important for stakeholders to be fully aware of the level of which resources are being monitored, which have safety failover features, and which ones are in need of more resilient solutions.

 

Monitoring & Alerting Requirements

There are two kinds of alarms Warnings & Breach, with different levels of sensitivity. A Warning alarm signals that the server in question should be monitored closely and a contingency plan should be considered. A Breach alarm signals that the server in question is in a suboptimal state that could potentially lead to instability or failure and remediation actions must be taken immediately. When an alarm is triggered, there will be an alerting component as defined below: 

  1. Alerts shall be sent Immediately after an alarm is triggered.
  2. Alerts shall be sent via Short Message Service (SMS) or email.
  3. Each alert shall contain identifying information about the resource in question and symptom that is making the system distress evident e.g. CUP utilization.
  4. There shall be a list of individuals that should receive the alerts 
  5. There shall be a list of qualified individuals who will serve as first responders 

QPAIR proposed solutions to meet requirements

We have several guiding principles in designing a bullet-proof infrastructure Monitoring system

  1.  Infrastructure as code: nothing we do should require us to remember to click something in an interface unless Terraform or Ansible does not support it. 
  2. Reduce code and infrastructure management overhead by using as many off-the-shelf components as possible.
  3. Consistently writing good tests will allow us to have confidence in deploying changes to such a business-critical infrastructure. 
  4. Tests provide a written history of the desired features and behaviors that we expect to happen and allow for an audit trail when we forgot something.
  5. For redundancy, added features and to ensure connectivity at all times, we recommend adopting a third-party monitoring solution like DataDog, NewRelic, etc..
  6. Implement tracing for your full-stack using the third party like DataDog APM tools, etc.

 

Bitnami