Senior MLOps Engineer - SRE | DevOps

HHeap

REMOTE · WORLDWIDE

$3K–$4K

Full-time

SENIOR

Apply Now

About the Role

Senior MLOps Engineer role combining SRE and DevOps to build and operate ML and AI systems at scale. You will own the path from model to production, ensuring reliability, cost efficiency, and automation.

Responsibilities

Build and operate model and inference serving infrastructure managing latency, throughput, autoscaling, and reliability.
Own the ML deployment lifecycle including model registry, versioning, promotion workflows, and safe rollback.
Operate agentic and LLM workloads in production managing inference providers, gateways, and guardrails.
Build reproducible automated ML pipelines with lineage and reproducibility.
Extend infrastructure-as-code to ML systems using Terraform and multi-account design.
Operate GitOps for ML workloads with ArgoCD configuration.
Run ML and AI workloads on multi-tenant Kubernetes (AWS EKS) with GPU scheduling.
Own ML reliability and observability including SLOs, drift detection, and on-call practices.
Drive ML cost efficiency through right-sizing accelerators and capacity management.

Requirements

5+ years in platform engineering, SRE, MLOps, or infrastructure with production systems at scale.
Hands-on experience deploying and operating ML or AI workloads in production.

H

Heap

San Francisco · US · 117+ employees

Heap is a product analytics platform that automatically captures user interactions across websites and mobile apps to provide digital insights. The company is now part of Contentsquare and specializes in helping digital teams understand user behavior through auto-captured data, retroactive analysis, and AI-powered insights.

Related Jobs

Technical Architect- MLOPs Engineer

Mphasis

TORONTO · CA

Senior MLOps Engineer (Full Remote from France)

Alan

Strong SRE/DevOps foundation including SLOs, post-mortems, and reliability improvements.

Deep IaC expertise with complex Terraform state and multi-account configurations.

Strong GitOps background with declarative infrastructure management.

Deep Kubernetes knowledge at control plane level.

Strong AWS background in networking, compute, IAM, storage, multi-account design.

Hands-on experience with CI/CD pipelines (GitHub Actions, CircleCI, GitLab CI).

Automation-first thinking at senior level.

Active user of agentic coding tools.

Strong communication skills for cross-team collaboration.

Nice to Have

Experience with GPU/accelerator scheduling and node lifecycle management in production.
Experience operating LLM inference at scale with quota management and guardrails (e.g., AWS Bedrock).
Experience with ML pipeline and orchestration tooling (Argo Workflows, Kubeflow, Airflow, SageMaker Pipelines).
Experience with model registries, feature stores, and experiment tracking (MLflow, Feast).
Familiarity with model and data drift monitoring and ML-specific observability.
Background in FinOps with inference cost attribution and capacity planning.
Experience with multi-tenant infrastructure and data infrastructure (object storage, CDC pipelines).