Talent.com
Tidak lagi menerima permohonan
Site Reliability Engineer (SRE)

Site Reliability Engineer (SRE)

Refine GroupKuala Lumpur, Kuala Lumpur, Malaysia
3 hari lalu
Penerangan pekerjaan

Responsibilities

  • Design Failover Systems : Design and maintain scalable failover systems, backup strategies, and redundancy mechanisms across cloud and on-prem environments.
  • Develop DR Documentation : Create and update disaster recovery documentation, runbooks, and recovery playbooks for infrastructure and application layers.
  • Business Continuity Testing : Plan, coordinate, and execute tabletop exercises, DR drills, and failover simulations; analyze and report outcomes, identify gaps, and lead remediation initiatives.
  • Incident Response & Crisis Management : Develop incident response procedures, escalation paths, and communication frameworks for major outages; act as a key responder and facilitator during critical incidents to ensure swift coordination across teams.
  • Data Backup & Recovery Strategy : Implement and manage cloud-based and on-premise backup solutions aligned with defined RTO and RPO; regularly test and validate data restoration processes.
  • 24 / 7 / 365 Coverage : Participate in a rotating on‑call schedule to ensure continuous coverage; operate within a 3‑shift structure with 9‑hour shifts and overlapping hour for smooth transitions.
  • Collaboration with Tier 1 and Tier 2 Support : Work closely with Tier 1 and Tier 2 teams as first point of contact for incidents and service requests; provide expertise and escalation support to ensure efficient resolution and seamless communication.

Qualifications

  • Bachelor’s degree in Computer Science, Engineering, or a related field.
  • 3+ years of experience in Site Reliability Engineering, DevOps, or Infrastructure roles.
  • Proven experience with DR planning, testing, and recovery operations.
  • Proficiency in AWS, focusing on services that support infrastructure and application layers.
  • Hands‑on experience with backup solutions such as Veeam, Rubrik, AWS Backup, and Azure Site Recovery.
  • Strong understanding of high availability, system redundancy, and incident management frameworks (ITIL, NIST).
  • Familiarity with monitoring and alerting tools (Prometheus, Grafana, Splunk, PagerDuty).
  • Strong spoken and written English communication skills.
  • Preferred Skills

  • Certifications in cloud platforms (e.g., AWS Solutions Architect, Azure Administrator).
  • Experience with chaos engineering or reliability testing tools (Gremlin, Chaos Monkey).
  • #J-18808-Ljbffr

    Buat amaran kerja untuk carian ini

    Site Reliability Engineer • Kuala Lumpur, Kuala Lumpur, Malaysia