Staff ML Platform Engineer – Large Scale Training (LLMOps/MLOps)
Job Description
About TrueFoundry
Every production AI system, whether it's powering customer support, writing code, analyzing financial data, or diagnosing medical conditions, needs the same foundational infrastructure.A way to route between models. A way to manage tools and integrate them securely. A way to orchestrate agents and enforce governance. A unified compute layer to run it all.
That infrastructure layer is being built right now.
We're TrueFoundry, and we're building it. We're looking for a Staff ML Platform Engineer – Large Scale Training (LLMOps/MLOps) to join the team.
The Problem We're Solving
Companies are moving beyond simple chatbots to production agentic systems. These systems route between OpenAI, Anthropic, Google, and self-hosted models. They integrate dozens of tools via protocols like MCP. They orchestrate multi-agent workflows where agents coordinate with other agents.
The infrastructure to support this doesn't exist yet. You can't just duct-tape together a few API calls and call it production-ready.
You need a control plane that handles:
- Intelligent routing with observability, cost policies, and fallback logic
- Centralized tool and MCP server management with security and lifecycle controls
- Agent orchestration with governance and guardrails
- A unified compute layer to run self-hosted models, custom tools, and agents
We've built two products to solve this:
AI Gateway is the control plane, five composable components (Prompts, LLM Gateway, MCP Gateway, Guardrails, Agent Gateway) that handle routing, orchestration, and governance.
AI Deploy is the compute layer, a Kubernetes-based platform that abstracts ML workloads as standard software primitives, so everything runs on unified infrastructure.
We're Series A, backed by Intel Capital and Sequoia. Companies like CVS, Mastercard, Siemens, Paytm, Synopsys, and Zscaler run production AI workloads on our platform.
We're looking for ML Systems Engineers who are passionate about scaling deep learning workloads, optimizing multi-GPU training, and shipping production-grade solutions. If you live and breathe PyTorch, multi-node training, and love solving gnarly infra challenges—this is your place.
What You’ll Work On
- Write clean, modular, and scalable Python code, with a strong emphasis on reliability and performance.
- Build platform for training and finetuning large-scale ML models across multi-GPU, multi-node clusters with PyTorch, Kubeflow, and other orchestration tools.
- Own the infrastructure and code that enables high-throughput, low-latency inference pipelines for state-of-the-art models.
- Build platform for developing, deploying and evaluating agentic applications for our end customers.
- Help shape internal standards and best practices across the engineering team for high-scale ML workloads.
What We’re Looking For
- 5+ years of hands-on experience building and deploying ML systems at scale.
- 5+ years of writing production quality high performance code.
- Deep experience with multi-GPU/multi-node training, ideally with PyTorch as your primary framework.
- Experience working with torch, high-level ML frameworks, and inference engines (vLLM or TensorRT).
- Experience with Kubernetes is highly preferred; exposure to Kubernetes-native tools is a huge plus.
- A pragmatic mindset—you know when to optimize and when to ship.
- Bonus: Familiarity with open-source LLM training/fine-tuning.
Why Join TrueFoundry?
- Work directly with ex-Facebook engineers and founders from IIT Kharagpur, UC Berkeley, and Y Combinator alumni.
- First-hand exposure to building and scaling a deep-tech startup—insights you’ll carry if you want to start your own one day.
- Be part of a fearlessly experimental culture focused on customer success and long-term impact.
- Flexible hours, learning credits, and the opportunity to work shoulder-to-shoulder with the co-founders (Abhishek & Nikunj).