Snowflake's ML Platform team is building Cortex Training, an LLM post-training platform. They are seeking a Senior Software Engineer to design and build scalable distributed systems for GPU compute, productionize research building blocks, and drive performance at scale.
Responsibilities
Design and build across the full stack — from the public training APIs and SDK through the control plane to the GPU data plane.
Scale the distributed systems that make GPU compute serverless — multi-tenant scheduling, placement, and capacity-aware routing across regional GPU pools, with fault tolerance built in.
Drive end-to-end performance at scale — keep the training, inference, and RL loops fast and the data plane responsive under heavy concurrent load, with GPUs kept saturated.
Productionize research building blocks — partner with Snowflake Research to turn state-of-the-art training and inference techniques into reliable, composable components customers can run at enterprise scale.
Requirements
5+ years building and shipping production ML systems
Strong distributed systems and infrastructure foundation — designing scalable, fault-tolerant services and operating them on Kubernetes in production.
Familiarity with GPU and LLM infrastructure — e.g., PyTorch, DeepSpeed/FSDP, Ray, CUDA/NCCL, vLLM; able to debug across the data, infrastructure, and GPU layers.