Site : MVC Resources Kuala Lumpur, Federal Territory of Kuala Lumpur, Malaysia Overview
We are seeking a
Site Reliability Engineer
to join our team in Kuala Lumpur. The role focuses on building and maintaining robust, scalable, and resilient systems to ensure reliability and performance of critical services. Responsibilities
Monitor and maintain system performance to ensure high availability and reliability. Design and implement resilient system architectures. Develop automation tools and scripts to enhance operational efficiency. Define, track, and analyze SLOs and SLIs to meet business needs. Conduct post-mortem analyses after incidents and drive continuous improvement. Collaborate with development teams to establish and promote best practices. Troubleshoot and resolve issues related to database performance, network connectivity, and deployment failures. Identify and resolve performance bottlenecks in applications and infrastructure. Participate in on-call rotations and respond to critical incidents. Analyze system logs and metrics to identify trends and opportunities for improvement. Qualifications
Strong experience with Linux systems and distributed computing fundamentals. Proven experience troubleshooting application issues with focus on performance and connectivity. Familiarity with networking concepts and effective troubleshooting techniques. Experience in Bash / Shell scripting or automation for system administration tasks. Experience in programming languages such as Python, Golang, or Java. Demonstrated experience in system architecture and design, prioritizing reliability and scalability. Understanding of SRE principles, including SLOs, SLIs, toil reduction, and incident post-mortems. Hands-on experience with cloud environments (AWS, Azure, Google Cloud) and their operational management. Excellent problem-solving abilities and a proactive approach to operational challenges. Ability to work independently while effectively collaborating within a team. Open to a rotational shift schedule across different time slots, with schedules shared in advance. Ability to communicate effectively in Mandarin is an added advantage. Preferred Skills
Observability & Monitoring : Prometheus, Grafana, Alertmanager, Loki, Jaeger / Tempo, OpenTelemetry Containerization & Orchestration : Kubernetes, Helm, service mesh (Istio / Linkerd) Big Data & Streaming : Apache Flink, Kafka, Spark Infrastructure as Code & Automation : Terraform, Ansible, CI / CD pipelines Cloud Platforms : AWS, Azure, GCP Programming & Scripting : Python, Go, Bash Resiliency & Reliability Engineering : Incident response, RCA, chaos engineering, disaster recovery Shift
Morning (Hybrid) : 7am - 4pm Afternoon (Flexi) : 3pm - 12am Night (Flexi) : 11pm - 8am Contact
Reach out to Cheyenne at with your updated resume for a private conversation. Employment details
Seniority level : Mid-Senior level Employment type : Full-time Job function : Information Technology Industries : Online Audio and Video Media
#J-18808-Ljbffr
Reliability Engineer • Kuala Lumpur, Malaysia