Direct message the job poster from University of Malaya
The ideal candidate will design, organize, and modify the company's computer systems. This individual will evaluate and assess systems to ensure they are operating effectively. Based on assessments, this individual will harness collected knowledge and make adjustments to existing systems.
Here is the detailed breakdown of responsibilities for the role :
1. System & Infrastructure Management
- Cluster Operations : Install, configure, and maintain Linux-based HPC clusters, compute nodes, and associated server hardware.
- Storage Management : Manage and optimize large-scale parallel file systems (e.g., Lustre, GPFS) and high-performance storage solutions.
- Network Administration : Configure, manage, and monitor the high-speed HPC network infrastructure, including InfiniBand and Ethernet fabrics, to ensure optimal performance.
- Security & Patching : Implement system security policies, perform regular security hardening, and apply OS patches and updates to ensure system integrity.
- Backup & Recovery : Oversee and execute data backup and disaster recovery procedures for critical systems and user data.
2. Application & Software Support
Software Deployment : Install, compile, and manage a wide range of scientific applications, compilers (e.g., GNU, Intel), and parallel libraries (e.g., MPI, OpenMP, CUDA).Scheduler Management : Manage and configure the HPC job scheduling system (e.g., Slurm, PBS) to ensure fair resource allocation, manage queues, and optimize cluster efficiency.Application Troubleshooting : Assist researchers in debugging and optimizing their parallel codes and software workflows.3. Monitoring & Performance Tuning
System Monitoring : Implement and maintain robust monitoring tools (e.g., Ganglia, Nagios, Prometheus / Grafana) to track cluster health, resource utilization, and job performance.Problem Resolution : Proactively identify, troubleshoot, and resolve system bottlenecks, hardware failures, and software issues to minimize downtime.Performance Analysis : Analyze system logs and performance metrics to recommend and implement optimisations for the cluster and storage systems.Technical Support : Serve as a primary point of contact for researchers and students, providing expert technical support for job submission, data management, and software issues.Account Management : Manage user accounts, project allocations, and resource quotas.Training & Documentation : Develop and deliver training workshops, user guides, and technical documentation to help users effectively utilize the HPC resources.Liaison : Collaborate with researchers to understand their computational needs and provide guidance on HPC best practices.Qualifications
Bachelor's degree in computer science, preferably in networking or computer systemsExperience as a System AdministratorInterested to learn about HPCStrong analytical skillsLocal candidate only (Malaysian)Seniority level
Entry levelEmployment type
Full-timeIndustries
Education Management#J-18808-Ljbffr