As an Infrastructure Platform Engineer, you will build and maintain the infrastructure that powers both our AI application runtime and model training workflows. You will own secure, observable, and scalable environments that support model hosting, prompt execution, agent tools, and internal model training pipelines. Your work ensures that product and platform engineers can deploy and scale AI workloads efficiently across cloud and on‑prem infrastructure. This role blends DevOps, ML systems engineering, and platform development for AI workloads.
Responsibilities
- Application Infrastructure
Manage model routing, fallback, and token usage enforcement across LLM providers.
Operate and optimize model‑serving infrastructure (e.g., vLLM, Triton, OpenAI proxies).Build and maintain tool execution runtimes and internal service orchestration layers.Implement secure API gateways, rate limiting, authentication, and quota management.Training InfrastructureDevelop training pipelines for pre‑training and other fine‑tuning workflows.
Manage GPU scheduling, storage access, and experiment tracking (e.g., MLflow, Weights & Biases).Partner with AI researchers and platform engineers to operationalise training and evaluation runs.Maintain dataset versioning, access control, and data preprocessing pipelines.Platform OperationsMaintain CI / CD systems for platform services and runtime components.
Establish observability and monitoring systems across model, memory, and agent services.Apply best practices for infrastructure security, availability, and cost optimization.Document infrastructure components and standard deployment practices.Qualifications
Must-Have
6+ years' experience in infrastructure engineering, DevOps, or ML systemsStrong command of Kubernetes, Terraform, and cloud-native architecture (AWS, Azure, GCP)Experience with containerization, CI / CD, and API security practicesPrior exposure to model hosting or ML pipeline orchestrationUnderstanding of networking concepts including VPNs, VNets, and hybrid connectivity.Familiarity with security best practices for cross‑platform infrastructure.Experience with on‑prem infrastructure including networking, storage hardwareBonus
Experience with GPU resource orchestration or KubeflowFamiliarity with inference servers like vLLM, Triton, TGI, or TorchServeUnderstanding of cost telemetry and resource budgeting for model trafficSecurity mindset and experience with IAM, logging, and complianceFamiliarity with compliance frameworks (SOC2, GDPR, HIPAA) and implementing controls.Background in database management across different platforms.What Success Looks Like
Application infra consistently meets SLAs for latency, availability, and model cost‑efficiencyModel gateway and tool runtimes are secure, observable, and used across all verticals without incidentTraining infra enables researchers or platform engineers to run fine‑tuning and evaluation jobs with minimal bottlenecksCI / CD, monitoring, and deployment standards are adopted org-wide for AI workloadsYou proactively identify and resolve scaling, quota, or security risks before they impact productionsSenior Level
Mid‑Senior level
Employment Type
Full-time
Job Function
Information Technology
Referrals increase your chances of interviewing at Neuron Solutions Sdn Bhd by 2x
Kuala Lumpur, Federal Territory of Kuala Lumpur, Malaysia
Salary : MYR7,000 - MYR10,000
#J-18808-Ljbffr