Staff AI/ML Engineer – Large-Scale & Low-Precision AI

We are seeking a Staff-level (or higher) model training expert to lead large-scale model training with a focus on low-precision and power efficiency. This role combines hands-on ownership of 100B+ parameter training runs on TPUs using JAX with responsibility for setting technical direction, mentoring engineers, and raising training quality across the organization.

About Us: We build power-efficient, low-precision foundation models designed to run from edge devices to large-scale deployments. We train models ranging from roughly 1B to 100B+ parameters across LLMs, diffusion models, and other modalities, with a strong emphasis on efficient training, inference, and real-world deployment under power and memory constraints.

Role Overview: We are seeking a Staff-level (or higher) model training expert to lead large-scale model training with a focus on low-precision and power efficiency. This role combines hands-on ownership of 100B+ parameter training runs on TPUs using JAX with responsibility for setting technical direction, mentoring engineers, and raising training quality across the organization.

Responsibilities: You will design, implement, and debug distributed training pipelines on TPUs using JAX across all major training phases, while driving efficiency and stability at scale. Core responsibilities include:

  • Leading pretraining, SFT, RL/RLHF, and post-training optimization
  • Designing data curation strategies including filtering, deduplication, dataset mixing, and curriculum design
  • Applying and evaluating PTQ and QAT techniques for low-precision, power-efficient deployment
  • Optimizing convergence, throughput, memory usage, and numerical stability using advanced optimizers and parallelism strategies
  • Translating state-of-the-art research into reliable production training systems
  • Providing technical leadership through mentoring, design reviews, and cross-team collaboration

Basic Qualifications: You bring deep experience in large-scale ML systems and a strong foundation in modern model training, including:

  • 8–10+ years of experience in machine learning or AI
  • Strong Python programming skills with production-quality code
  • Hands-on experience training multi-billion-parameter models
  • Solid understanding of optimization, distributed training, and training dynamics
  • Experience with LLM training phases including pretraining, SFT, and RL-based methods
  • A demonstrated ability to mentor and technically lead other ML engineers

Preferred Qualifications: You have additional experience that directly aligns with our efficiency-focused mission, including:

  • Training very large models in the 100B+ range
  • Deep experience with TPUs and JAX, including XLA and SPMD optimization
  • Hands-on application of PTQ and QAT for low-precision models
  • Familiarity with edge or power-constrained deployment targets
  • Experience with advanced optimizers, scaling laws, and compute-efficient training
  • Prior contributions to research efforts or open-source ML systems

Ideal Candidate Profile: You have personally trained large models end-to-end, understand why large-scale training fails and how to fix it, care deeply about efficiency and real-world deployment, enjoy mentoring others, and are comfortable operating at the intersection of research, systems engineering, and product constraints.