As startups mature and scale so should the robustness and reliability of its cloud infrastructure. Monitoring your digital assets to the cloud can seem overwhelming at times. But you’re not alone. AWS has done a good job of meeting you halfway to help with it, AWS calls it the Shared Responsibility Model. Both you and AWS are each responsible for the health and security of your systems. The goal of this series of blogs is to showcase your responsibility in various requirements of observability and failover solutions that account for the current and projected complexity of organizations Infrastructure. These solutions will collectively provide stakeholders a strong sense of confidence in the reliability and resilience of the company, allowing them to focus their efforts on further growth.
A comprehensive monitoring solution has multiple layers. Monitoring will provide an accurate representation of the system. With monitoring comes alerting and forensics. Suppose the monitoring system just noticed an increased load on a server, how do we alert people? Who do we alert? Why did it break in the first place? Let’s answer these questions in detail for each piece of cloud infrastructure.
In order to build a resilient system that also meets the needs of its stakeholders, we at QPAIR recommend that we put in place processes that will result in a more robust infrastructure.
It is important for stakeholders to be fully aware of the level of which resources are being monitored, which have safety failover features, and which ones are in need of more resilient solutions.
Monitoring & Alerting Requirements
There are two kinds of alarms Warnings & Breach, with different levels of sensitivity. A Warning alarm signals that the server in question should be monitored closely and a contingency plan should be considered. A Breach alarm signals that the server in question is in a suboptimal state that could potentially lead to instability or failure and remediation actions must be taken immediately. When an alarm is triggered, there will be an alerting component as defined below:
We have several guiding principles in designing a bullet-proof infrastructure Monitoring system