About Us: We build power-efficient, low-precision foundation models designed to run from edge devices to large-scale deployments. We train models ranging from roughly 1B to 100B+ parameters across LLMs, diffusion models, and other modalities, with a strong emphasis on efficient training, inference, and real-world deployment under power and memory constraints.
Role Overview: We are seeking a Staff-level (or higher) model training expert to lead large-scale model training with a focus on low-precision and power efficiency. This role combines hands-on ownership of 100B+ parameter training runs on TPUs using JAX with responsibility for setting technical direction, mentoring engineers, and raising training quality across the organization.
Responsibilities: You will design, implement, and debug distributed training pipelines on TPUs using JAX across all major training phases, while driving efficiency and stability at scale. Core responsibilities include:
Basic Qualifications: You bring deep experience in large-scale ML systems and a strong foundation in modern model training, including:
Preferred Qualifications: You have additional experience that directly aligns with our efficiency-focused mission, including:
Ideal Candidate Profile: You have personally trained large models end-to-end, understand why large-scale training fails and how to fix it, care deeply about efficiency and real-world deployment, enjoy mentoring others, and are comfortable operating at the intersection of research, systems engineering, and product constraints.