Get AI-powered advice on this job and more exclusive features.
Direct message the job poster from FINEXUS Group
Senior Specialist, Talent Acquisition | IT Recruitment Expert @ Finexus Hiring Top Talent! Ex-BNM | Driving Excellence in Recruitment
System Reliability & Operations
- Ensure high availability and reliability of IT systems, applications, and PCI DSS‑certified data centres, supporting both internal operations and client‑facing platforms.
- Perform system administration (Linux and Windows servers), including installation, configuration, patching, monitoring, and performance tuning.
- Manage data storage, backup, and disaster recovery (DRP) to ensure data integrity, resilience, and compliance with industry standards.
- Conduct capacity planning and lifecycle management of infrastructure resources, ensuring optimal performance and scalability.
- Define and monitor Service Level Objectives (SLOs), Service Level Indicators (SLIs), and Error Budgets to measure and improve reliability.
- Implement chaos testing and fault‑injection practices to proactively identify weaknesses and improve system resilience.
- Optimize observability and alerting systems (e.g., Prometheus, Grafana, ELK, Nagios or equivalent) to ensure actionable insights and minimal alert fatigue.
Security & Compliance
Implement and maintain system and network security controls, including firewall management, VPN, identity / access management, and endpoint security.Ensure compliance with BNM RMiT, PCI DSS, and ISO 27001 standards, supporting internal and external audits.Manage system logs and integrate with SIEM platforms to strengthen monitoring and incident response capabilities.Support vulnerability management programs by coordinating with Security Operations teams for timely patching and remediation.Participate in risk assessment and security architecture reviews, ensuring SRE practices align with compliance requirements.Cloud, Containerization & Automation
Support and optimize hybrid cloud environments (AWS, Azure, GCP) to align with Finexus’ cloud strategy and cost efficiency.Deploy, configure, and maintain Kubernetes clusters (SUSE Rancher Prime) and containerized workloads to improve scalability and reliability.Build and maintain CI / CD pipelines for automated deployment, testing, and operational efficiency.Automate configuration and patch management using tools such as Ansible, Puppet, or equivalent.Implement Infrastructure as Code (IaC) using Terraform or equivalent for consistent and auditable environment provisioning.Develop auto‑healing and self‑recovery automation scripts to reduce manual interventions and mean time to recovery (MTTR).Implement cost optimization and performance monitoring for cloud and container workloads.Networking & Core Services
Administer and troubleshoot DNS, DHCP, VPN, load balancers, and core network services to ensure smooth operations.Support virtualization platforms (Proxmox / etc) and physical server infrastructure within Finexus data centres.Integrate network observability tools for real‑time visibility into latency, bandwidth, and routing anomalies.Collaborate on zero‑trust network segmentation and service mesh integration for improved security and reliability.Monitoring & Support
Provide on‑call support on a rotational basis for production issues and incidents, ensuring rapid resolution and minimal downtime.Collaborate with application, database, and security teams to deliver reliable, compliant, and high-performance services for clients.Lead post‑incident reviews (PIRs) and blameless retrospectives to identify root causes and preventive actions.Maintain runbooks and operational documentation to streamline response and improve knowledge transfer.Leverage AIOps or event‑correlation tools to enhance proactive incident detection and reduce false positives.Job Requirements
Bachelor’s or Master’s Degree in Computer Science, Information Technology, Engineering, or related field.4+ years of experience in Site Reliability Engineering, System Administration, or IT Infrastructure.Proven experience in Linux and Windows system administration.Hands‑on experience with cloud operations (AWS, Azure, GCP) and container orchestration (Kubernetes, Rancher).Strong knowledge of networking, firewalls, DNS, DHCP, VPN, and enterprise security best practices.Experience in database management (MySQL, PostgreSQL, or equivalent), including backup, tuning, and recovery.Knowledge of compliance frameworks (PCI DSS, ISO 27001, BNM RMiT) is highly desirable.Strong problem‑solving and troubleshooting skills in mission‑critical environments.Excellent communication skills in English and Malay (spoken and written).Ability to work independently and collaboratively in a fast‑paced, regulated technology environment.Experience with SRE toolchains : Prometheus, Grafana, ELK, Terraform, Ansible, Jenkins, GitLab CI / CD, or equivalent.Possession of relevant certifications, including AWS Certified SysOps Administrator, RHCE, Kubernetes Administrator (CKA), or ISO 27001 Implementer, will be considered an added advantage.Seniority level
AssociateEmployment type
Full‑timeJob function
Engineering, Administrative, and Information TechnologyIndustries
Technology, Information and Media#J-18808-Ljbffr