Talent.com
AI Infrastructure Engineer (Senior / Team Lead)

AI Infrastructure Engineer (Senior / Team Lead)

Neuron Solutions Sdn BhdKuala Lumpur, Kuala Lumpur, Malaysia
3 days ago
Job description

As an Infrastructure Platform Engineer, you will build and maintain the infrastructure that powers both our AI application runtime and model training workflows. You will own secure, observable, and scalable environments that support model hosting, prompt execution, agent tools, and internal model training pipelines. Your work ensures that product and platform engineers can deploy and scale AI workloads efficiently across cloud and on‑prem infrastructure. This role blends DevOps, ML systems engineering, and platform development for AI workloads.

Responsibilities

  • Application Infrastructure

Manage model routing, fallback, and token usage enforcement across LLM providers.

  • Operate and optimize model‑serving infrastructure (e.g., vLLM, Triton, OpenAI proxies).
  • Build and maintain tool execution runtimes and internal service orchestration layers.
  • Implement secure API gateways, rate limiting, authentication, and quota management.
  • Training Infrastructure
  • Develop training pipelines for pre‑training and other fine‑tuning workflows.

  • Manage GPU scheduling, storage access, and experiment tracking (e.g., MLflow, Weights & Biases).
  • Partner with AI researchers and platform engineers to operationalise training and evaluation runs.
  • Maintain dataset versioning, access control, and data preprocessing pipelines.
  • Platform Operations
  • Maintain CI / CD systems for platform services and runtime components.

  • Establish observability and monitoring systems across model, memory, and agent services.
  • Apply best practices for infrastructure security, availability, and cost optimization.
  • Document infrastructure components and standard deployment practices.
  • Qualifications

    Must-Have

  • 6+ years' experience in infrastructure engineering, DevOps, or ML systems
  • Strong command of Kubernetes, Terraform, and cloud-native architecture (AWS, Azure, GCP)
  • Experience with containerization, CI / CD, and API security practices
  • Prior exposure to model hosting or ML pipeline orchestration
  • Understanding of networking concepts including VPNs, VNets, and hybrid connectivity.
  • Familiarity with security best practices for cross‑platform infrastructure.
  • Experience with on‑prem infrastructure including networking, storage hardware
  • Bonus

  • Experience with GPU resource orchestration or Kubeflow
  • Familiarity with inference servers like vLLM, Triton, TGI, or TorchServe
  • Understanding of cost telemetry and resource budgeting for model traffic
  • Security mindset and experience with IAM, logging, and compliance
  • Familiarity with compliance frameworks (SOC2, GDPR, HIPAA) and implementing controls.
  • Background in database management across different platforms.
  • What Success Looks Like

  • Application infra consistently meets SLAs for latency, availability, and model cost‑efficiency
  • Model gateway and tool runtimes are secure, observable, and used across all verticals without incident
  • Training infra enables researchers or platform engineers to run fine‑tuning and evaluation jobs with minimal bottlenecks
  • CI / CD, monitoring, and deployment standards are adopted org-wide for AI workloads
  • You proactively identify and resolve scaling, quota, or security risks before they impact productions
  • Senior Level

    Mid‑Senior level

    Employment Type

    Full-time

    Job Function

    Information Technology

    Referrals increase your chances of interviewing at Neuron Solutions Sdn Bhd by 2x

    Kuala Lumpur, Federal Territory of Kuala Lumpur, Malaysia

    Salary : MYR7,000 - MYR10,000

    #J-18808-Ljbffr

    Create a job alert for this search

    Infrastructure Engineer • Kuala Lumpur, Kuala Lumpur, Malaysia