Talent.com
Site Reliability Engineer (SRE)

Site Reliability Engineer (SRE)

FPT SoftwareKuala Lumpur, Kuala Lumpur, Malaysia
1 hari lalu
Penerangan pekerjaan

Design and maintain scalable failover systems, backup strategies, and redundancy mechanisms across cloud and on-prem environments.

Develop and update DR documentation, runbooks, and recovery playbooks for infrastructure and application layers.

2. Business Continuity Testing :

Plan, coordinate, and execute tabletop exercises, DR drills, and failover simulations.

Analyze and report outcomes of BC / DR tests; identify gaps and lead remediation initiatives.

3. Incident Response & Crisis Management :

Develop and refine incident response procedures, escalation paths, and communication frameworks for major outages.

Act as a key responder and facilitator during critical incidents, ensuring swift coordination across teams.

4. Data Backup & Recovery Strategy :

Implement and manage cloud-based and on-premise backup solutions, aligned with defined RTO (Recovery Time Objective) and RPO (Recovery Point Objective).

Regularly test and validate data restoration processes to ensure system recoverability.

5. 24 / 7 / 365 Coverage :

Participate in a rotating on-call schedule to ensure continuous coverage.

Daily operations will include 3 shifts, each lasting 9 hours, with 1 member per shift and an overlapping hour between shifts to facilitate smooth transitions.

6. Collaboration with Tier 1 and Tier 2 Support :

  • Work closely with Tier 1 and Tier 2 teams who will serve as the first point of contact for incidents and service requests.
  • Provide expertise and escalation support as needed, ensuring efficient resolution of issues and seamless communication between teams.

Qualifications :

  • Bachelor’s degree in Computer Science, Engineering, or a related field.
  • 3+ years of experience in Site Reliability Engineering, DevOps, or Infrastructure roles.
  • Proven experience with DR planning, testing, and recovery operations.
  • Proficiency in AWS, with a focus on relevant services that support infrastructure and application layers.
  • Hands‑on experience with backup solutions (e.g., Veeam, Rubrik, AWS Backup, Azure Site Recovery).
  • Strong understanding of high availability, system redundancy, and incident management frameworks (e.g., ITIL, NIST).
  • Familiarity with monitoring and alerting tools (e.g., Prometheus, Grafana, Splunk, PagerDuty).
  • Strong spoken and written English communication skills, essential for effective collaboration with global teams.
  • Preferred Skills :

  • Certifications in cloud platforms (e.g., AWS Solutions Architect, Azure Administrator)
  • Experience with chaos engineering or reliability testing tools (e.g., Gremlin, Chaos Monkey).
  • Be careful - Don’t provide your bank or credit card details when applying for jobs. Don't transfer any money or complete suspicious online surveys. If you see something suspicious, report this job ad.

    #J-18808-Ljbffr

    Buat amaran kerja untuk carian ini

    Site Reliability Engineer • Kuala Lumpur, Kuala Lumpur, Malaysia