Site Reliability Engineer
Location : FINEXUS Group, Kuala Lumpur, Federal Territory of Kuala Lumpur, Malaysia.
Responsibilities
- Ensure high availability and reliability of IT systems, applications, and PCI DSS‑certified data centers, supporting both internal operations and client‑facing platforms.
- Perform system administration of Linux and Windows servers, including installation, configuration, patching, monitoring, and performance tuning.
- Manage data storage, backup, and disaster recovery (DRP) to ensure data integrity, resilience, and compliance with industry standards.
- Conduct capacity planning and lifecycle management of infrastructure resources, ensuring optimal performance and scalability.
- Define and monitor Service Level Objectives (SLOs), Service Level Indicators (SLIs), and Error Budgets to measure and improve reliability.
- Implement chaos testing and fault‑injection practices to proactively identify weaknesses.
- Optimize observability and alerting systems (e.g., Prometheus, Grafana, ELK, Nagios or equivalent) to ensure actionable insights and minimal alert fatigue.
- Implement and maintain system and network security controls, including firewall management, VPN, identity / access management, and endpoint security.
- Support compliance with BNM RMiT, PCI DSS, ISO 27001 standards and external audits.
- Manage logs and integrate SIEM platforms to strengthen monitoring and incident response.
- Support vulnerability management and coordinate with Security Operations teams for patching.
- Deploy, configure, and maintain Kubernetes clusters (SUSE Rancher Prime) and containerized workloads.
- Build and maintain CI / CD pipelines for automated deployment, testing, and operational efficiency.
- Automate configuration and patch management using tools such as Ansible, Puppet, or equivalent.
- Implement IaC using Terraform or equivalent for consistent environment provisioning.
- Automate auto‑healing and self‑recovery scripts to reduce MTTR.
- Optimize cost and performance for cloud and container workloads.
- Administer and troubleshoot DNS, DHCP, VPN, load balancers, and core network services.
- Support virtualization platforms and physical server infrastructure within data centers.
- Collaborate on zero‑trust segmentation and service mesh integration.
- Provide on‑call support, collaborate on incident resolution, and maintain runbooks.
- Lead post‑incident reviews (PIRs) and blameless retrospectives.
- Leverage AIOps or event‑correlation tools for proactive incident detection.
Requirements
Bachelor’s or Master’s Degree in Computer Science, IT, Engineering or related field.4+ years of experience in Site Reliability Engineering, System Administration or IT Infrastructure.Proven experience in Linux and Windows system administration.Hands‑on experience with cloud operations (AWS, Azure, GCP) and container orchestration (Kubernetes, Rancher).Strong knowledge of networking, firewalls, DNS, DHCP, VPN, and enterprise security best practices.Experience in database management (MySQL, PostgreSQL, or equivalent) including backup, tuning, and recovery.Knowledge of compliance frameworks (PCI DSS, ISO 27001, BNM RMiT) is highly desirable.Strong problem‑solving and troubleshooting skills in mission‑critical environments.Excellent communication skills in English and Malay (spoken and written).Ability to work independently and collaboratively in a fast‑paced, regulated technology environment.Experience with SRE toolchains : Prometheus, Grafana, ELK, Terraform, Ansible, Jenkins, GitLab CI / CD or equivalent.Relevant certifications such as AWS Certified SysOps Administrator, RHCE, Kubernetes Administrator (CKA), or ISO 27001 Implementer are an advantage.#J-18808-Ljbffr