Snowflake is seeking a Senior Software Engineer to build the LLM post-training platform, Cortex Training, which turns GPU capacity into a composable service for customers to adapt open-weight models. The role involves designing and scaling distributed systems for GPU compute, productionizing research building blocks, and driving end-to-end performance.
Responsibilities
Design and build across the full stack from public training APIs and SDK through control plane to GPU data plane.
Scale distributed systems for multi-tenant scheduling, placement, and capacity-aware routing across regional GPU pools.
Drive end-to-end performance to keep training, inference, and RL loops fast and GPUs saturated.
Productionize research building blocks in partnership with Snowflake Research.
Requirements
5+ years building and shipping production ML systems.
Strong distributed systems and infrastructure foundation (Kubernetes).
Familiarity with GPU and LLM infrastructure (PyTorch, DeepSpeed/FSDP, Ray, CUDA/NCCL, vLLM).
Ability to harden complex systems for reliability, throughput, and cost efficiency.