The role bridges ML research and production-grade systems, owning infrastructure, pipelines, and CI/CD for deploying and scaling ML models.
Responsibilities
Design, build, and maintain robust ML pipelines using Kubeflow, Airflow, or Prefect for automated model retraining and batch inference
Develop and manage feature stores (e.g., Feast or Tecton) for consistent feature engineering
Deploy ML models as high-throughput, low-latency microservices using Triton Inference Server, KServe, or FastAPI
Implement monitoring and alerting for model drift and system performance using Prometheus, Grafana, and Evidently AI
Containerize ML workloads with Docker and orchestrate on Kubernetes
Establish CI/CD pipelines for automated testing and deployment (GitOps, GitHub Actions, GitLab CI)
Requirements
3-6 years of experience in DevOps, MLOps, or Software Engineering with focus on ML infrastructure
Proficiency in Python and familiarity with shell scripting, Terraform, and cloud platforms (AWS, GCP, or Azure)
Hands-on experience with Kubernetes and containerized applications at scale
Familiarity with ML lifecycle platforms such as MLflow, Weights & Biases, or SageMaker Pipelines
Solid understanding of version control, automated testing, and code review
Nice to Have
Experience with large language model deployment (vLLM, Ollama), distributed training frameworks (Ray, Spark), or a BS/MS in Computer Science or related field