About Us
We build high-performance foundation models designed to run efficiently across a wide range of environments—from edge devices to large-scale deployments. Our work spans models from ~1B to 100B+ parameters across LLMs, diffusion models, and other modalities, with a strong focus on scalable training, efficient inference, and real-world deployment.
Our Bonsai family of 1-bit and ternary models is designed to dramatically improve the efficiency of modern AI systems, enabling advanced intelligence to run with significantly lower memory usage, latency, and energy consumption across cloud and edge environments.
Role Overview
We are seeking a Senior-level (or higher) AI/ML engineer with expertise in post-training systems to contribute to the development of our post-training platform for Bonsai models. This role focuses on building scalable systems for fine-tuning, reinforcement learning, evaluation, orchestration, and model lifecycle management across cloud infrastructure and partner-hosted environments.
Responsibilities
You will design, build, and optimize platform infrastructure supporting post-training workflows for highly efficient AI models. Core responsibilities include:
- Building scalable systems for fine-tuning, reinforcement learning, evaluation, and post-training workflows for Bonsai models
- Developing infrastructure for data ingestion, dataset preparation, orchestration, artifact storage, logging, telemetry, and cost tracking
- Supporting post-training techniques including LoRA, full fine-tuning, PPO, GRPO, DPO, and related optimization workflows
- Building reliable multi-tenant infrastructure with strong isolation, access control, observability, and production reliability
- Developing systems for model evaluation, benchmarking, experiment tracking, and model lifecycle management
- Collaborating with model, infrastructure, and product teams to improve training efficiency, usability, and deployment readiness
- Translating advances in post-training workflows and AI infrastructure into robust, production-ready platforms
Basic Qualifications
You bring experience building AI/ML infrastructure and post-training systems:
- 5–8+ years of experience in machine learning systems, distributed systems, infrastructure engineering, or related fields
- Strong programming skills in Python and experience building production-quality AI/ML systems
- Hands-on experience building infrastructure for fine-tuning, reinforcement learning, or large-scale AI workflows
- Solid understanding of distributed systems, orchestration, and modern AI/ML pipelines
- Experience deploying AI/ML systems in cloud or production infrastructure environments
- Familiarity with observability, monitoring, and debugging production systems
- Proven ability to mentor and collaborate effectively with other engineers
Preferred Qualifications
You have additional experience aligned with scalable post-training platforms and efficient AI systems:
- Experience building platforms for fine-tuning, reinforcement learning, evaluation, and model management for LLMs or multimodal models
- Familiarity with post-training methods such as LoRA, RLHF, PPO, DPO, GRPO, or related optimization approaches
- Experience working with quantized, compressed, or low-bit models (e.g., 1-bit or ternary representations)
- Familiarity with orchestration systems, multi-tenant infrastructure, API gateways, and production platform operations
- Experience building developer-facing platforms including SDKs, CLIs, APIs, or self-serve tooling
- Experience supporting cloud-based or partner-hosted AI workflows and deployment pipelines
- Contributions to open-source AI infrastructure, tooling, or model training frameworks
Ideal Candidate Profile
You have built or significantly contributed to AI infrastructure platforms that support large-scale post-training workflows. You understand the challenges involved in fine-tuning, reinforcement learning, evaluation, and model management for modern AI systems, and you know how to build reliable, scalable infrastructure around them. You care deeply about usability, efficiency, and developer experience, enjoy solving complex systems problems, and thrive at the intersection of AI models, infrastructure, and real-world deployment.