The Site Reliability Engineer (SRE) ensures the reliability and performance of critical services, bridging development and operations. The role focuses on scalable infrastructure, SRE practices such as SLOs and SLIs, and reducing operational toil. Collaboration with teams to improve reliability and foster a continuous learning culture is key.
- Design and implement resilient system architectures for high availability and scalability.
- Develop automation tools and scripts to improve operational efficiency.
- Define, track, and analyze SLOs and SLIs for performance and reliability.
- Conduct post-mortem analyses and implement improvements based on findings.
- Collaborate on best practices for system reliability and incident management.
- Troubleshoot and resolve database, network, and deployment issues.
- Ensure issue resolution meets Service Level Agreements (SLAs).
- Identify and address system performance bottlenecks with actionable recommendations.
- Maintain documentation for processes and incident responses.
(Apply now at #J-18808-Ljbffr