Site Reliability Engineer - Incident Commander
Job ID
Posted since
03-Oct-2025
Organization
Field of work
Information Technology
Company
Siemens Industry Software Sdn. Bhd.
Early Professional
Job type
Full-time
Office / Site only
Employment type
Permanent
Location(s)
Siemens Digital Industries Software is a leading provider of solutions for the design, simulation, and manufacture of products across many different industries. Formula 1 cars, skyscrapers, ships, space exploration vehicles, and many of the objects we see in our daily lives are being conceived and manufactured using our Product Lifecycle Management (PLM) software.
Are you ready to make a tangible impact on critical cloud-based applications in a dynamic and collaborative environment? Join our organization, whereyou willbe at the forefront of enhancing service and application availability, optimizing processes through innovative automation, and solving complex technical challenges.
In this crucial role,you willdevelop cutting-edge automated solutions that support and sustain our best-in-class cloud infrastructure, particularly for the vital Siemens Xcelerator platform. When incidents arise,you willcoordinate major incident response, ensuring rapid resolution and seamless communication with our partners during service-impacting events, all while upholding our strict Service Level Agreements (SLAs). Your exceptional communication and coordination skills will be paramount, asyou willdirectly contribute to our product teams consistently meeting their commitments and driving overall platform reliability.
Key Responsibilities
- Incident Management : Act as the primary point of contact and leader during major incidents, coordinating the response, communication, and resolution efforts across all involved teams.
- Incident Response : Quickly assess the severity of incidents, determine the impact, and drive the appropriate response to restore services as quickly as possible.
- Communication : Ensure clear, concise, and timely communication with stakeholders, including technical teams, management, and customers, throughout the incident lifecycle.
- Post-Incident Analysis : Lead post-incident reviews to identify root causes, drive improvements, and implement preventive measures to reduce the likelihood of recurrence.
- Collaboration : Work closely with SRE, DevOps, Development, and other relevant teams to ensure that incident management processes are well-defined and continuously improved.
- Training & Preparedness : Conduct regular incident response drills, train teams on incident management processes, and ensure readiness for handling high-severity incidents.
- Documentation : Maintain and update incident management documentation, ensuring that all procedures are up-to-date and accessible to all relevant teams.
- Monitoring & Alerts : Collaborate with SRE and monitoring teams to define and refine alerting criteria, ensuring that incidents are detected and escalated promptly.
- Continuous Improvement : Find opportunities to improve system reliability, scalability, and performance based on lessons learned from incidents.
- 24x7 On-call rotation : Participate in 24x7 on-call rotation.
Qualifications
Technical Expertise : Familiar with cloud infrastructure (AWS, GCP, Azure), containerization (Docker, Kubernetes), and automation scripting (Python, Bash).Incident Management Tools : Familiarity with incident management platforms (e.g., Jira Service Management, ServiceNow), monitoring tools (e.g., Datadog, Grafana), and on-call systems (e.g., PagerDuty).Incident Response & Resolution : Proven ability to rapidly assess, troubleshoot, and resolve complex incidents in distributed enterprise IT environments, ensuring quick service restoration while remaining calm under pressure.Leadership & Stakeholder Management : Demonstrated leadership in incident response, effectively managing cross-functional teams and aligning with business and product stakeholders.Communication : Outstanding English communication skills, both verbal and written, including strong listening and synthesis abilities.Metrics & Continuous Improvement : Skilled in defining, tracking, and utilizing incident metrics (e.g., MTTR, MTTD) to drive accountability and continuous improvement.Problem-Solving : Excellent troubleshooting and problem-solving skills, with the ability to quickly analyze complex systems.Proactive Learning & Availability : Highly motivated to continuously learn new technologies and adapt to evolving trends, with availability to work required core hours.Nice to have : Relevant certifications (e.g., AWS Certified Solutions Architect, Certified Kubernetes Administrator)
We are an equal opportunity employer and value diversity at our company. We do not discriminate on the basis of race, religion, color, national origin, sex, gender, gender expression, sexual orientation, age, marital status, veteran status, or disability status.
We are Siemens
A collection of over 377,000 minds building the future, one day at a time in over 200 countries. We\'re dedicated to equality, and we welcome applications that reflect the diversity of the communities we work in. All employment decisions at Siemens are based on qualifications, merit, and business need. Bring your curiosity and creativity and help us shape tomorrow!
We offer a comprehensive reward package which includes a competitive basic salary, bonus scheme, generous holiday allowance, pension, and private healthcare.
#J-18808-Ljbffr