Senior MLOps Engineer role combining SRE and DevOps to build and operate ML and AI systems at scale. You will own the path from model to production, ensuring reliability, cost efficiency, and automation.
Responsibilities
Build and operate model and inference serving infrastructure managing latency, throughput, autoscaling, and reliability.
Own the ML deployment lifecycle including model registry, versioning, promotion workflows, and safe rollback.
Operate agentic and LLM workloads in production managing inference providers, gateways, and guardrails.
Build reproducible automated ML pipelines with lineage and reproducibility.
Extend infrastructure-as-code to ML systems using Terraform and multi-account design.
Operate GitOps for ML workloads with ArgoCD configuration.
Run ML and AI workloads on multi-tenant Kubernetes (AWS EKS) with GPU scheduling.
Own ML reliability and observability including SLOs, drift detection, and on-call practices.
Drive ML cost efficiency through right-sizing accelerators and capacity management.
Requirements
5+ years in platform engineering, SRE, MLOps, or infrastructure with production systems at scale.
Hands-on experience deploying and operating ML or AI workloads in production.
Heap is a product analytics platform that automatically captures user interactions across websites and mobile apps to provide digital insights. The company is now part of Contentsquare and specializes in helping digital teams understand user behavior through auto-captured data, retroactive analysis, and AI-powered insights.