NVIDIA is looking for a Senior MLOps Engineer to build and maintain CI/CD pipelines for its GenAI Frameworks (Megatron-LM and NeMo). You will develop scalable DevOps solutions, manage cluster operations, and collaborate with deep learning teams to accelerate AI research and development.
Responsibilities
Develop and maintain continuous integration pipelines and release processes for Megatron-LM and NeMo Framework.
Implement efficient and scalable DevOps solutions for frequent, high-quality software releases.
Work with Kubernetes, Docker, Slurm, Ansible, GitLab, GitHub Actions, Jenkins, Artifactory, and Jira in hybrid on-premise and cloud environments.
Assist with cluster operations and system administration.
Automate accuracy and performance regression detection to accelerate R&D cycles.
Develop quality control measures like code analysis, backwards compatibility, and regression testing.
Requirements
BS or MS in Computer Science or related field (or equivalent experience) and 3+ years of DevOps/infrastructure engineering experience.
Strong system-level programming in Python and shell scripting.
Experience with build/release systems and CI/CD (GitLab, GitHub, Jenkins).