Back to all jobs
T

Senior SRE/DevOps Engineer

Truefoundry|AI Infrastructure
Bengaluru, India
Engineering

Job Description

About TrueFoundry

Every production AI system, whether it's powering customer support, writing code, analyzing financial data, or diagnosing medical conditions, needs the same foundational infrastructure.A way to route between models. A way to manage tools and integrate them securely. A way to orchestrate agents and enforce governance. A unified compute layer to run it all.

That infrastructure layer is being built right now.

We're TrueFoundry, and we're building it. We're looking for a Senior SRE/DevOps Engineer to join the team.

The Problem We're Solving

Companies are moving beyond simple chatbots to production agentic systems. These systems route between OpenAI, Anthropic, Google, and self-hosted models. They integrate dozens of tools via protocols like MCP. They orchestrate multi-agent workflows where agents coordinate with other agents.

The infrastructure to support this doesn't exist yet. You can't just duct-tape together a few API calls and call it production-ready.

You need a control plane that handles:

  • Intelligent routing with observability, cost policies, and fallback logic
  • Centralized tool and MCP server management with security and lifecycle controls
  • Agent orchestration with governance and guardrails
  • A unified compute layer to run self-hosted models, custom tools, and agents

We've built two products to solve this:

AI Gateway is the control plane, five composable components (Prompts, LLM Gateway, MCP Gateway, Guardrails, Agent Gateway) that handle routing, orchestration, and governance.

AI Deploy is the compute layer, a Kubernetes-based platform that abstracts ML workloads as standard software primitives, so everything runs on unified infrastructure.

We're Series A, backed by Intel Capital and Sequoia. Companies like CVS, Mastercard, Siemens, Paytm, Synopsys, and Zscaler run production AI workloads on our platform.

Roles / Responsibilities:

  • Write Terraform modules for deploying different component of infrastructure in AWS like Kubernetes, RDS, Prometheus, Grafana, Static Website
  • The SRE will work closely with TrueFoundry customers, gaining a deep understanding of the TrueFoundry platform to ensure smooth deployments, reliable operations, and best practices adoption. This role will also involve training and onboarding new customers, assisting them in implementing TrueFoundry effectively, and helping drive platform adoption and operational excellence across customer teams.
  • Configure networking, autoscaling. continuous deployment, security and multiple environments
  • Make sure the infrastructure is SOC2, ISO 27001 and HIPAA compliant
  • Automate all the steps to provide a seamless experience to developers.

Requirements

*** Experience with Golang or Python is must.**

  • 4+ years work experience writing clean production code
  • Well versed with maintaining infrastructure as code (Terraform, Cloudformation etc). High proficiency with Terraform / Terragrunt is absolutely critical
  • Experience of setting CI/CD pipelines from scratch
  • Experience with ETL pipelines, Bigdata infra
  • Understanding of common security issues

Interview Process We will complete the entire interview process within 1 week -

1 Kubernetes Focused 2 Terraform Focused Round 3 Past Projects Discussion 4 Cultural Fit Round

Benefits at TrueFoundry

  • Work with top engineers who led the Facebook Videos and Infrastructure team
  • Flexible working hours and directly with Co-founders
  • Team discussions on product and business growth strategies
  • Insurance and other benefits like learning credits

Our Way Of Working

  • An opportunity to work on something that really matters
  • A fast-paced environment to learn and grow
  • High transparency in decision-making
  • High autonomy; freedom to take risks, to experiment, and to fail
  • Full ownership and autonomy
  • There is no glass ceiling for this role that limits your growth
  • We promise a meaningful journey and opportunities to learn and grow

About US: TrueFoundry is a Cloud-native Machine Learning Training and Deployment PaaS on Kubernetes. TrueFoundry is a powerful LLMOps and MLOps platform that enables Machine learning teams to train and Deploy models at the speed of Big Tech with 100% reliability and scalability - allowing them to save cost and train models at lightning speed with scalability.

TrueFoundry also facilitates faster experimentation with Open Source LLMs on your cloud infrastructure while simultaneously reducing operational expenses. TrueFoundry has built a fearlessly experimental, customer-obsessed team who are making discoveries to fundamentally change how people build and consume business applications. Today, we're partnering with the world's leading companies to transform how they use data and technology.

Team:

Founded by alumni from IIT Kharagpur, UC Berkeley, and ex-FaceBook Engineers, we have had folks from IITs, ISB, Facebook, Amazon, , GoJek, etc. Funded by top global Investors (Sequoia Capital, ENIAC) and angels (Naval Ravikant, Anthony Goldbloom). 2nd-time founders - their previous Postmanus startup (EntHire.co) was acquired by InfoEdge + was selected to be a part of Y Combinator.

About Truefoundry

First seen: May 1, 2026
Last updated: May 1, 2026