Talent.com
This job offer is not available in your country.
Site Reliability Engineer

Site Reliability Engineer

RazerBangsar South
30+ days ago
Job description

Job Responsibilities :

We are seeking a skilled and driven Site Reliability Engineer (SRE) to join our growing infrastructure and platform engineering team. The ideal candidate will have hands-on experience in Amazon Web Services (AWS), strong troubleshooting capabilities, and a passion for building scalable, observable, and resilient systems using modern Infrastructure as Code (IaC) and automation tools.

REQUIREMENTS :

  • Bachelor’s degree in Computer Science, Software Engineering, Information Technology, or a related field.
  • Minimum 2 years of experience in SRE, DevOps, Cloud Infrastructure, or Systems Administration roles.
  • Solid hands-on experience with AWS Cloud services including (but not limited to) :
  • Compute : EC2, Lambda, ECS, Auto Scaling
  • Networking : VPC, Load Balancers, Route 53
  • Messaging & Storage : SQS, S3, RDS, ElastiCache, SES
  • Monitoring : CloudWatch, X-Ray
  • Proficient in Infrastructure as Code using Terraform and / or CloudFormation.
  • Experience with CI / CD tools (e.g., GitLab CI, Jenkins, CodePipeline, ArgoCD).
  • Strong understanding of Linux and Windows system administration and troubleshooting.
  • Comfortable with one or more scripting / programming languages such as Python, Node.js, Bash, Ruby, or JSON / YAML for automation.
  • Experience with containerization and orchestration (Docker, ECS, or Kubernetes is a plus).
  • Familiar with observability tools and incident management best practices.

JOB DESCRIPTION :

  • Design, develop, and maintain Infrastructure as Code (IaC) using tools like Terraform or AWS CloudFormation.
  • Implement and operate reliable, scalable cloud infrastructure primarily on AWS (e.g., EC2, ECS, RDS, S3, Lambda, ElastiCache, SQS, SES, Auto Scaling, Load Balancers).
  • Lead and participate in architecture reviews focusing on reliability, scalability, security, and performance.
  • Develop and manage robust monitoring, alerting, and logging solutions (e.g., CloudWatch, Prometheus, Grafana, ELK, etc.) to detect and resolve issues proactively.
  • Perform incident management, postmortems, root cause analysis, and implement continuous improvement strategies.
  • Collaborate with software engineering teams to improve CI / CD pipelines, deployment automation, and release management.
  • Automate infrastructure operations, reduce manual toil, and improve reliability using scripting (Python, Bash, Node.js, or Ruby).
  • Maintain and troubleshoot environments involving web servers, databases, firewalls, DNS, load balancers, and networking.
  • Ensure systems are compliant with security standards, including patching, hardening, and secure access policies.
  • Provide on-call support, participate in incident rotations.
  • Monitor and maintain service-level objectives (SLOs), SLAs, and error budgets to ensure reliability targets are met.
  • Support from 5 : 00PM to 2 : 00AM (UTC+8) shift to ensure continuous of SRE coverage.
  • Undergo initial familiarization period during regular working hours before transitioning to the designated shift.
  • Provide support and solution handling to incident and tickets assigned.
  • Pre-Requisites : Are you game?

    Create a job alert for this search

    Reliability Engineer • Bangsar South