About Us
We build high-performance foundation models designed to run efficiently across a wide range of environments—from edge devices to large-scale deployments. Our work spans models from ~1B to 100B+ parameters across LLMs, diffusion models, and other modalities, with a strong focus on scalable training, efficient inference, and real-world deployment.
Role Overview
We are seeking a Staff-level (or higher) AI/ML engineer to lead large-scale model training efforts. This role combines hands-on ownership of large training runs with responsibility for setting technical direction, mentoring engineers, and improving model quality and system performance across the organization.
Responsibilities
You will design, implement, and optimize distributed training systems for large-scale models across all major training phases. Core responsibilities include:
- Leading model development across pretraining, fine-tuning, and post-training stages
- Designing and improving data pipelines, including curation, filtering, deduplication, and dataset composition
- Improving training efficiency, scalability, and reliability across large distributed systems
- Optimizing model performance with respect to convergence, throughput, memory usage, and stability
- Translating cutting-edge research into robust, production-ready systems
- Providing technical leadership through mentoring, design reviews, and cross-functional collaboration
Basic Qualifications
You bring deep experience in large-scale AI/ML systems and strong fundamentals in modern model training:
- 8–10+ years of experience in machine learning or AI or strong publication record
- Strong Python programming skills with production-quality code
- Hands-on experience training large-scale models (multi-billion parameters)
- Solid understanding of optimization, distributed training, and training dynamics
- Experience with modern model training workflows (e.g., pretraining, fine-tuning, reinforcement learning approaches)
- Proven ability to mentor and lead other AI/ML engineers
Preferred Qualifications
You have additional experience aligned with large-scale, high-performance AI/ML systems:
- Experience training very large models (tens to hundreds of billions of parameters)
- Familiarity with modern accelerator hardware (e.g., GPUs or TPUs) and distributed training frameworks
- Experience improving system performance, resource utilization, and training efficiency
- Exposure to deployment environments with real-world constraints (e.g., latency, cost, or hardware limitations)
- Experience with advanced optimization techniques and scaling strategies
- Contributions to research, publications, or open-source AI/ML systems
Ideal Candidate Profile
You have led or significantly contributed to training large models end-to-end, understand common failure modes in large-scale training systems, and know how to debug and improve them. You care about building efficient, reliable systems that work in real-world settings, enjoy mentoring others, and thrive at the intersection of research, engineering, and product.