Location: Remote

Responsibilities:
Participate in on-call rotations, responding to production incidents during non-business hours, weekends, and holidays as needed. Manage and resolve system incidents by leading incident bridges, troubleshooting, and driving resolution.
Continuously monitor system performance using telemetry tools to identify and resolve potential issues before they impact service reliability. Ensure all performance metrics remain within acceptable limits and drive towards KPIs.
Maintain automation tools, reducing manual efforts and increasing reliability.
Lead and execute buildouts, ensuring timely deliveries, and troubleshooting deployment issues.
-  Analyze operational data, create dashboards, and report on system chokepoints, throughput, and performance. Identify areas for cycle time reduction and incident toil minimization.
-  Conduct postmortem reviews and lead blameless post-incident reviews to determine root cause and improve service resiliency. Implement preventive measures to avoid repeat issues.

 Create and maintain comprehensive documentation, including technical procedures, playbooks, and TSGs, to help streamline incident response and improve operational knowledge sharing.

Bachelor’s Degree in Computer Science, or related technical discipline
Strong experience with Azure DevOps, including subscription management, Azure Portal, CLI, and deploying Azure Virtual Machines.
A minimum of two years of professional experience with the following programming languages: .NET, C#, and PowerShell/scripting.
At least three years of hands-on experience in the microservices architecture domain, including ability in DevOps deployment strategies, such as CI/CD pipelines.
A minimum of two years of experience in Cloud, Infrastructure, or Platform domain.
At least two years of expertise in infrastructure automation or infrastructure as code (IaC) frameworks.
Two years of Site Reliability Engineering (SRE) experience, with proven ability to address and resolve Severity 1 and Severity 2 incidents efficiently under live-site conditions.
- Experience using Ev2 for deploying Azure resources in public and sovereign clouds, monitoring deployments, and troubleshooting issues.
Experience with real-time incident management, driving incident bridges, troubleshooting, and performing root cause analysis in production environments.
Hands-on experience with source control management, including initiating code changes, handling pull requests, and collaborating via Git and Microsoft Visual Studio.
Strong ability to document troubleshooting procedures, playbooks, and create detailed Trouble Shooting Guides (TSGs) for recurring issues.

In addition to the technical qualifications, candidates must show:
• Attention to detail, ensuring accuracy and precision in the execution of tasks.
• Diligence, displaying consistency and thoroughness in their approach to responsibilities.
• Strong soft skills, including communication, collaboration, adaptability, and problem-solving, as these are critical for fostering effective teamwork, resolving challenges, and supporting operational excellence in a dynamic environment.

DevOps Site Reliability Engineer (SRE) L3

Apply for this position

About the Author: Anusha

USA (Headquarter)

India

China

Global locations

New Address

Let’s Make Things Happen

Contact Info