Revolutionizing the Nine Pillars of SRE With AI-Engineered Tools

September 25, 2023

Just as SRE practices complement DevOps practices, these nine pillars of SRE complement the nine pillars of DevOps explained in my recent blog Revolutionizing the Nine Pillars of DevOps With AI-Engineered Tools.

AI Applications for the Nine Pillars of SRE

The following briefly explains how the current generation of AI-engineered large language model (LLM) tools are being applied or could be practically applied to improve each of the nine pillars of SRE.

Culture: AI can help identify patterns and trends in system behavior that may not be immediately evident. This can assist in shifting the wisdom of production further to the left. Tools like Datadog’s Watchdog can be used to automatically detect and surface anomalies in your system behavior, for example.

Toil Reduction and Automation: AI can help reduce toil by automating more complex tasks that were previously challenging to automate. Machine learning models can predict system behavior, enabling proactive responses. AI operations platforms like Moogsoft or BigPanda can automate incident management, from detection to remediation.

SLAs/SLOs/SLIs and Error Budgets: AI can assist in refining SLAs, SLOs and SLIs by providing accurate predictions of system performance and identifying potential bottlenecks. Tools like Nobl9 can use machine learning to better predict and define SLOs based on historical performance data.

Measurements and Observability: AI can help analyze vast amounts of data from monitoring and observability systems, identifying patterns and correlations that may be difficult for humans to detect. Tools like Dynatrace and New Relic use AI to analyze monitoring data, detect anomalies and correlate events.

Anti-Fragility: AI can contribute to the development of more resilient systems by analyzing system behavior under stress and identifying vulnerabilities. The data gathered from chaos experiments using tools like Gremlin can be used in conjunction with machine learning models to predict system behavior.

Work-Sharing and Incremental Technical Debt: AI can help analyze code and infrastructure changes to assess their impact on technical debt. Tools like DeepCode and SonarQube use AI to provide guidance on how to manage this debt incrementally.

Deployments: AI can be used to optimize deployment strategies by predicting the impact of new releases. A tool like Harness can use machine learning to optimize deployment strategies and minimize risks.

Performance Management of Apps and Infrastructure: AI can analyze application and infrastructure performance data to identify optimization opportunities and predict capacity requirements. Tools like Turbonomic leverage AI to optimize resource allocation in real-time.

Incident Management: AI can assist with incident management by automating the detection and triage of incidents, helping to quickly identify the root cause. Tools like PagerDuty use AI to automate detection, triage and suggest potential remediation steps, leading to faster resolution times.

Pitfalls and Challenges

Applying AI to SRE is a complex process with certain challenges. Here are some potential pitfalls along with ways to address them:

Lack of Quality Data: AI and machine learning models are only as good as the data they are trained on. Inadequate or poor quality data can lead to inaccurate predictions and insights.
• Prioritize data hygiene and governance. Collect comprehensive and diverse data from your systems; ensure that it is well-structured and free of errors and store it in a way that’s easily accessible for training AI models.

Over-reliance on Automation: While AI can greatly enhance automation, relying on it too heavily without human oversight can lead to missed signals or overcorrections in response to false positives.
• Maintain a balance between automation and human oversight. Use AI to support decision-making, not replace it entirely. It’s important to have experienced SREs review AI outputs regularly to ensure they make sense and are beneficial.

Underestimating the Need for AI Expertise: Implementing AI is not just about buying and deploying a tool. It requires a deep understanding of AI and machine learning principles, as well as the ability to interpret the results correctly.
• Invest in training your team on AI and machine learning principles or consider hiring AI specialists. Having in-house AI expertise can help ensure that you’re using AI tools effectively and interpreting their outputs correctly.

Ignoring the Importance of Integration: AI tools need to work seamlessly with your existing infrastructure and processes to be effective. A lack of integration can lead to data silos and inefficiencies.
• When selecting AI tools, consider how easily they can be integrated into your existing infrastructure and workflows. Use APIs and other integration methods to ensure that your AI tools can access the data they need and that their insights are easily accessible to your SRE team.

Neglecting Ethics and Privacy: AI tools often need access to sensitive data to operate effectively, which can raise ethical and privacy concerns.
• Implement robust data privacy and security measures. Ensure that you’re following all relevant regulations and industry best practices for data privacy. Use anonymization and other data protection techniques to protect sensitive data.


Artificial intelligence is revolutionizing SRE practices by automating complex tasks, analyzing vast amounts of data and making proactive predictions. AI can reduce toil, enhance system understanding, and streamline incident management. It allows SREs to shift further left by integrating predictive insights into the development process, thereby enhancing the culture of proactive issue prevention. AI also aids in defining more precise service level objectives, error budgets, and optimizing deployment strategies based on predictive analytics. In the realm of incident management, AI aids in detecting, triaging, and mitigating incidents faster by identifying patterns and suggesting remediations. Overall, AI is fast becoming an essential tool for SREs, facilitating a shift from reactive problem-solving to proactive management, thereby increasing system reliability, performance, scalability and efficiency.


Latest Articles

All Articles
VMware Looks to Streamline Multi-Cloud Computing Management

VMware Looks to Streamline Multi-Cloud Computing Management

VMware today integrated its console for managing cloud instances with its VMware Cloud Foundation to streamline the deployment of its software in on-premises IT environments.

AI: A Game-Changer for SRE Work-Sharing and Technical Debt

AI: A Game-Changer for SRE Work-Sharing and Technical Debt

Sharing Wisdom of Production: AI tools like Splunk and Datadog can analyze log files and telemetry data to uncover hidden patterns and trends, improving system understanding and enabling knowledge sharing.

Why Multi-Cloud Cost Optimization is Harder Than it Looks

Why Multi-Cloud Cost Optimization is Harder Than it Looks

Nine out of 10 large companies have already adopted multiple clouds, and IT analysts expect that even more businesses will embrace multi-cloud architectures over the next several years.