About Us
We build high-performance foundation models designed to run efficiently across a wide range of environments—from edge devices to large-scale deployments. Our work spans models from ~1B to 100B+ parameters across LLMs, diffusion models, and other modalities, with a strong focus on scalable training, efficient inference, and real-world deployment.
Role Overview
We are seeking a Senior-level (or higher) AI/ML engineer with deep expertise in systems and kernel development to lead efforts in optimizing low-level performance across our model stack. This role focuses on designing and implementing high-performance kernels that accelerate inference and training for highly efficient model architectures, including 1-bit and other compressed representations, across diverse hardware platforms.
Responsibilities
You will design, implement, and optimize high-performance kernels and low-level systems to maximize efficiency across a range of inference runtimes and hardware targets. Core responsibilities include:
- Designing and implementing custom kernels for model execution across GPUs and other accelerator hardware
- Optimizing inference performance for highly efficient model representations (e.g., 1-bit or quantized models)
- Improving throughput, latency, and memory efficiency across different inference runtimes and deployment environments
- Collaborating with model and systems teams to co-design architectures and execution strategies for maximum performance
- Profiling and debugging performance bottlenecks at the kernel, runtime, and system levels
- Translating advances in hardware-aware optimization into production-ready systems
- Providing technical leadership through mentoring, design reviews, and cross-functional collaboration
Basic Qualifications
You bring deep experience in systems engineering and performance optimization for AI/ML workloads:
- 5–8+ years of experience in systems engineering, machine learning infrastructure, or related fields
- Strong programming skills in C/C++ and/or CUDA (or equivalent low-level languages) with production-quality code
- Hands-on experience developing and optimizing kernels for GPUs or other accelerators
- Solid understanding of computer architecture, memory hierarchies, and parallel programming models
- Experience profiling and optimizing performance-critical systems
- Proven ability to mentor and lead other engineers
Preferred Qualifications
You have additional experience aligned with high-performance AI systems and hardware-aware optimization:
- Experience optimizing inference for quantized or compressed models (e.g., low-bit or 1-bit representations)
- Familiarity with modern inference runtimes and compiler stacks (e.g., TensorRT, TVM, Triton, XLA, or similar)
- Experience working across different hardware platforms (e.g., GPUs, CPUs, custom accelerators)
- Knowledge of numerical methods and trade-offs in reduced-precision computation
- Experience improving system-level performance, including kernel fusion, scheduling, and memory optimization
- Contributions to performance-critical systems, open-source frameworks, or hardware-aware ML research
Ideal Candidate Profile
You have built and optimized kernels or low-level systems that significantly improve performance for large-scale AI workloads. You understand how model architecture, numerical representation, and hardware interact, and you know how to push systems to their limits. You care deeply about efficiency and performance, enjoy working close to the hardware/software boundary, and are comfortable leading efforts that span models, runtimes, and infrastructure.