As an engineer or a site reliability engineer, what makes your job difficult or stressful?
Is it improving app performance upon a new release? Maybe it is the slow process of being alerted about an issue. Or is it not having the correct resources and tools available to solve said problems?
The best way for an SRE to resolve issues with app performance is to leverage your background in agile. The best way to look at these issues is to use an app centric approach. Using dashboards to get a holistic look at your app performance or operational issues, you are able to see the infrastructure without everything getting too complex. Further, having the issues across a dashboard also allows for a quick resolution.
How are SREs alerted about issues in app performance, though?
When issues occur and an SRE needs to be alerted, having a very robust and reliable ticket/chat tool is critical. Instead of using a basic ticketing platform that isn’t timely, an IM format will almost certainly work better every time. A chat platform allows for remote teams to be effective despite distance.