Design and maintain scalable failover systems, backup strategies, and redundancy mechanisms across cloud and on-prem environments.
Develop and update DR documentation, runbooks, and recovery playbooks for infrastructure and application layers.
2. Business Continuity Testing :
Plan, coordinate, and execute tabletop exercises, DR drills, and failover simulations.
Analyze and report outcomes of BC / DR tests; identify gaps and lead remediation initiatives.
3. Incident Response & Crisis Management :
Develop and refine incident response procedures, escalation paths, and communication frameworks for major outages.
Act as a key responder and facilitator during critical incidents, ensuring swift coordination across teams.
4. Data Backup & Recovery Strategy :
Implement and manage cloud-based and on-premise backup solutions, aligned with defined RTO (Recovery Time Objective) and RPO (Recovery Point Objective).
Regularly test and validate data restoration processes to ensure system recoverability.
5. 24 / 7 / 365 Coverage :
Participate in a rotating on-call schedule to ensure continuous coverage.
Daily operations will include 3 shifts, each lasting 9 hours, with 1 member per shift and an overlapping hour between shifts to facilitate smooth transitions.
6. Collaboration with Tier 1 and Tier 2 Support :
Qualifications :
Preferred Skills :
Be careful - Don’t provide your bank or credit card details when applying for jobs. Don't transfer any money or complete suspicious online surveys. If you see something suspicious, report this job ad.
#J-18808-Ljbffr
Site Reliability Engineer • Kuala Lumpur, Kuala Lumpur, Malaysia