Site Reliability Engineer (SRE)

FPT SoftwareKuala Lumpur, Kuala Lumpur, Malaysia

1 hari lalu

Penerangan pekerjaan

Design and maintain scalable failover systems, backup strategies, and redundancy mechanisms across cloud and on-prem environments.

Develop and update DR documentation, runbooks, and recovery playbooks for infrastructure and application layers.

2. Business Continuity Testing :

Plan, coordinate, and execute tabletop exercises, DR drills, and failover simulations.

Analyze and report outcomes of BC / DR tests; identify gaps and lead remediation initiatives.

3. Incident Response & Crisis Management :

Develop and refine incident response procedures, escalation paths, and communication frameworks for major outages.

Act as a key responder and facilitator during critical incidents, ensuring swift coordination across teams.

4. Data Backup & Recovery Strategy :

Implement and manage cloud-based and on-premise backup solutions, aligned with defined RTO (Recovery Time Objective) and RPO (Recovery Point Objective).

Regularly test and validate data restoration processes to ensure system recoverability.

5. 24 / 7 / 365 Coverage :

Participate in a rotating on-call schedule to ensure continuous coverage.

Daily operations will include 3 shifts, each lasting 9 hours, with 1 member per shift and an overlapping hour between shifts to facilitate smooth transitions.

6. Collaboration with Tier 1 and Tier 2 Support :

Work closely with Tier 1 and Tier 2 teams who will serve as the first point of contact for incidents and service requests.
Provide expertise and escalation support as needed, ensuring efficient resolution of issues and seamless communication between teams.

Qualifications :

Bachelor’s degree in Computer Science, Engineering, or a related field.

3+ years of experience in Site Reliability Engineering, DevOps, or Infrastructure roles.

Proven experience with DR planning, testing, and recovery operations.

Proficiency in AWS, with a focus on relevant services that support infrastructure and application layers.

Hands‑on experience with backup solutions (e.g., Veeam, Rubrik, AWS Backup, Azure Site Recovery).

Strong understanding of high availability, system redundancy, and incident management frameworks (e.g., ITIL, NIST).

Familiarity with monitoring and alerting tools (e.g., Prometheus, Grafana, Splunk, PagerDuty).

Strong spoken and written English communication skills, essential for effective collaboration with global teams.

Preferred Skills :

Certifications in cloud platforms (e.g., AWS Solutions Architect, Azure Administrator)

Experience with chaos engineering or reliability testing tools (e.g., Gremlin, Chaos Monkey).

Be careful - Don’t provide your bank or credit card details when applying for jobs. Don't transfer any money or complete suspicious online surveys. If you see something suspicious, report this job ad.

#J-18808-Ljbffr

Buat amaran kerja untuk carian ini

Site Reliability Engineer • Kuala Lumpur, Kuala Lumpur, Malaysia