Overview
iSoftStone Federal Territory of Kuala Lumpur, Malaysia
Site Reliability Engineer
About us : We're seeking a skilled and motivated Site Reliability Engineer (SRE) for a role in the insurance industry. The SRE will ensure the reliability and performance of our systems, collaborate with development teams on scalable solutions, and participate in incident response. This is an opportunity for those passionate about Kubernetes, infrastructure reliability, and cloud technologies to become a Subject Matter Expert (SME) in the SRE domain.
Responsibilities
- System Reliability : Collaborate with software development teams to ensure reliability is a key consideration throughout the software development life cycle. Design and implement scalable and resilient architectures for mission-critical applications. Design, implement, and manage Kubernetes clusters, ensuring high availability, fault tolerance, and scalability. Perform upgrades, patch management, and security enhancements for Kubernetes infrastructure.
- Automation and Infrastructure as Code (IaC) : Drive automation efforts to streamline deployment, scaling, and management of applications on Kubernetes and / or cloud environments. Implement CI / CD pipelines for deploying and updating Kubernetes applications. Develop and maintain Infrastructure as Code scripts (e.g., Terraform, Ansible) for provisioning and managing cloud and container resources. Leverage cloud services (AWS, GCP, Azure) to optimize Kubernetes infrastructure and seamlessly integrate with other cloud-native solutions. Implement best practices for deploying and managing Kubernetes on cloud platforms.
- Monitoring and Alerting : Implement effective monitoring and alerting solutions for Kubernetes clusters, applications, and underlying infrastructure. Proactively identify and address performance bottlenecks and reliability issues. Respond to and resolve incidents related to Kubernetes infrastructure and applications, ensuring minimal downtime and impact on users. Conduct post-incident reviews and implement improvements to prevent future issues.
- Capacity Planning : Perform capacity planning to ensure the Kubernetes infrastructure can accommodate current and future workloads in the cloud.
- Security : Collaborate with the security team to implement and maintain security best practices for Kubernetes environments in the cloud. Conduct regular security audits and vulnerability assessments.
- Collaboration and Documentation : Work closely with development, operations, and other teams to ensure a collaborative approach to infrastructure and application reliability. Maintain clear and comprehensive documentation for processes, configurations, and troubleshooting steps.
Qualifications
Bachelor's degree in Computer Science, Information Technology, or a related field.At least 1 year of experience as a Site Reliability Engineer or similar functional role.Strong programming or scripting skills , with proficiency in Bash, Python, Go, or Java.Extensive experience with Kubernetes orchestration , including cluster setup, management, and troubleshooting.Experience with infrastructure-as-code tools (e.g., Terraform, Ansible) and cloud platforms.Solid understanding of virtualization and networking concepts and principles.Excellent problem-solving and troubleshooting skills.Strong communication and collaboration skills.#J-18808-Ljbffr