Datadog AI Research (DAIR) seeks an AI Research Engineer to collaborate with research scientists on building systems for observability foundation models, SRE autonomous agents, and production code repair agents. The role involves developing training and evaluation pipelines, implementing models at scale, and integrating AI capabilities into Datadog's product ecosystem.
Responsibilities
Build and operate datasets, training and evaluation pipelines, benchmarks, and internal tooling
Implement models, run experiments at scale, and profile for reliability, performance, and cost
Orchestrate distributed training and distributed RL with Ray, including scheduling, scaling, and failure recovery
Make the research stack observable, reproducible, and easier to use
Establish rigorous automated benchmarks and regression tests for forecasting, anomaly detection, multi-modal analysis, agents, and code repair tasks
Collaborate with Research Scientists, Product, and Engineering to integrate advanced AI capabilities into Datadog’s product ecosystem and to harden prototypes into reliable services
Contribute high-quality code, documentation, and open-source artifacts
Requirements
Strong software engineering skills with experience in observability, SRE, or security
Depth in distributed computing and ML systems for training and inference at scale
WFH.team provides remote job intelligence for candidates and employers, offering confirmed remote job listings and resume-based matching. They also provide various employer-facing hiring tools and public resources for remote work workflows.
Proficiency in Python and familiarity with a systems language (e.g., Rust, C++, or Go)
Practical experience implementing and operating ML training and inference systems (e.g., PyTorch or JAX), including containerization, orchestration, and GPU acceleration
Familiarity with efficient training, fine-tuning, and inference techniques for large foundation models
Ability to explain design and performance trade-offs clearly
Strong interest in open-science and open-source contributions
Nice to Have
Demonstrated ability to bridge cutting-edge research prototypes and real-world product applications, ideally with large foundation models, generative AI agents, or domain-specific LLM deployments
Passion for pushing AI boundaries while focusing on customer impact, scalability, and responsible deployment
Hands-on experience with GPU programming and optimization, including CUDA
Experience writing production data pipelines and applications
Experience supporting or contributing to research publications