Site Reliability Engineer - Incident Commander
Join to apply for the Site Reliability Engineer - Incident Commander role at Siemens Digital Industries Software .
Siemens Digital Industries Software is a leading provider of solutions for the design, simulation, and manufacture of products across many industries, enabling innovation for everything from Formula 1 cars to space exploration vehicles.
In this crucial role, you will develop cutting‑edge automated solutions to support our best‑in‑class cloud infrastructure, particularly for the Siemens Xcelerator platform. When incidents arise, you will coordinate major incident response, ensuring rapid resolution and seamless communication with partners during service‑impacting events, while upholding strict SLAs.
Key Responsibilities
- Incident Management : Act as the primary point of contact and leader during major incidents, coordinating response, communication and resolution across all involved teams.
- Incident Response : Quickly assess severity, determine impact and drive appropriate response to restore services as quickly as possible.
- Communication : Ensure clear, concise and timely communication with stakeholders throughout the incident lifecycle.
- Post‑Incident Analysis : Lead reviews to identify root causes, drive improvements and implement preventive measures.
- Collaboration : Work closely with SRE, DevOps, Development and other teams to continuously improve incident management processes.
- Training & Preparedness : Conduct regular incident response drills and train teams to handle high‑severity incidents.
- Documentation : Maintain and update incident management documentation.
- Monitoring & Alerts : Define and refine alerting criteria to detect and escalates incidents promptly.
- Continuous Improvement : Find opportunities to improve system reliability, scalability and performance based on lessons learned from incidents.
- 24x7 On‑call rotation : Participate in 24x7 on‑call rotation.
Qualifications
Familiar with cloud infrastructure (AWS, GCP, Azure), containerization (Docker, Kubernetes) and automation scripting (Python, Bash).Experience with incident management platforms (Jira Service Management, ServiceNow), monitoring tools (Datadog, Grafana) and on‑call systems (PagerDuty).Proven ability to rapidly assess, troubleshoot and resolve complex incidents in distributed enterprise IT environments.Demonstrated leadership in incident response, managing cross‑functional teams and aligning with business stakeholders.Outstanding English communication skills, both verbal and written.Skilled in defining, tracking and utilizing incident metrics (MTTR, MTTD) to drive accountability and continuous improvement.Excellent troubleshooting and problem‑solving skills, with the ability to quickly analyze complex systems.Highly motivated to continuously learn new technologies and adapt to evolving trends, with availability to work required core hours.We are an equal opportunity employer and value diversity at our company. We do not discriminate on the basis of race, religion, color, national origin, sex, gender expression, sexual orientation, age, marital status, veteran status, or disability status.
We offer a comprehensive reward package which includes a competitive basic salary, bonus scheme, generous holiday allowance, pension and private healthcare.
#J-18808-Ljbffr