Design Failover Systems : Design and maintain scalable failover systems, backup strategies, and redundancy mechanisms across cloud and on-prem environments.
Develop DR Documentation : Create and update disaster recovery documentation, runbooks, and recovery playbooks for infrastructure and application layers.
Business Continuity Testing : Plan, coordinate, and execute tabletop exercises, DR drills, and failover simulations; analyze and report outcomes, identify gaps, and lead remediation initiatives.
Incident Response & Crisis Management : Develop incident response procedures, escalation paths, and communication frameworks for major outages; act as a key responder and facilitator during critical incidents to ensure swift coordination across teams.
Data Backup & Recovery Strategy : Implement and manage cloud-based and on-premise backup solutions aligned with defined RTO and RPO; regularly test and validate data restoration processes.
24 / 7 / 365 Coverage : Participate in a rotating on‑call schedule to ensure continuous coverage; operate within a 3‑shift structure with 9‑hour shifts and overlapping hour for smooth transitions.
Collaboration with Tier 1 and Tier 2 Support : Work closely with Tier 1 and Tier 2 teams as first point of contact for incidents and service requests; provide expertise and escalation support to ensure efficient resolution and seamless communication.
Qualifications
Bachelor’s degree in Computer Science, Engineering, or a related field.
3+ years of experience in Site Reliability Engineering, DevOps, or Infrastructure roles.
Proven experience with DR planning, testing, and recovery operations.
Proficiency in AWS, focusing on services that support infrastructure and application layers.
Hands‑on experience with backup solutions such as Veeam, Rubrik, AWS Backup, and Azure Site Recovery.
Strong understanding of high availability, system redundancy, and incident management frameworks (ITIL, NIST).
Familiarity with monitoring and alerting tools (Prometheus, Grafana, Splunk, PagerDuty).
Strong spoken and written English communication skills.
Preferred Skills
Certifications in cloud platforms (e.g., AWS Solutions Architect, Azure Administrator).
Experience with chaos engineering or reliability testing tools (Gremlin, Chaos Monkey).
#J-18808-Ljbffr
Buat amaran kerja untuk carian ini
Site Reliability Engineer • Kuala Lumpur, Kuala Lumpur, Malaysia