Astera Labs is hiring a Machine Learning Infrastructure Engineer to build the runtime, platform, and operational backbone for modern AI systems. This role focuses on model access layers, routing, serving, telemetry, observability, evaluation infrastructure, and controls to make fast-moving AI reliable.
Responsibilities
Build and improve internal AI infrastructure for LLM applications, agents, retrieval systems, and model-backed engineering workflows.
Own inference deployment paths across managed and self-serve environments, including access control, monitoring, and operational reliability.
Build platform layers such as model gateways, routing, runtime integrations, telemetry, and controls for safe execution at scale.
Develop AI Ops capabilities across evaluation, release readiness, observability, incident triage, regression detection, and cost monitoring.
Build dashboards, tracing, logging, and alerting for production AI systems, including spend and usage visibility.
Improve performance and unit economics through routing, caching, batching, failover, and latency/cost optimization.
Create reusable APIs, SDKs, and platform abstractions for easier deployment, evaluation, governance, and operation.
Requirements
1–5 years of experience in software engineering, ML infrastructure, MLOps, platform engineering, or related backend/infrastructure roles.