Job Description
Applied AI Engineer, Senior or Above
Location: Onsite 5 days a week in Austin, TX, relocation is offered.
Level is a learning technology company dedicated to helping students build real academic and life skills with confidence and joy. We combine proven curriculum principles with world class interactive design to make meaningful practice something students want to come back to, not something they struggle through.
We support what teachers, schools, and parents are already doing by increasing student engagement with high quality, standards aligned practice that reinforces classroom learning. That’s why we’re building:
For students: Rewarding and motivating learning experiences that meet students where they are to build real academic proficiency.
For educators: Tools that fit naturally into instruction and help students stay engaged while reducing teacher workload.
For parents: Activities that help your kid catch up or get ahead, build confidence, and minimize homework battles, whether learning at school or at home.
We are committed to helping every person maximize their potential and live a life of meaning. It’s a difficult problem that requires brilliant people and tremendous effort over time.
If you want to use your skills to make a difference for the world, we're seeking a Full-stack Engineer (senior or above) to build and operate highly-available, production-grade online web applications and services at scale. You’ll be at the core of the team building and operating our online platform powering all of Level’s products.
What You'll Do:
Ship production agentic systems- Design and build agents and agentic workflows that solve a defined problem end to end. You own prompts, tools, retrieval, guardrails, observability, cost and latency budgets, and rollout.
Automate SME workflows- Identify high-leverage operational toil (review pipelines, content QA, labeling and ops loops, support triage, internal copilots) and partner with the SMEs running those workflows to define success criteria, validate outputs, and replace meaningful chunks of that work with AI systems they trust. SMEs are co-owners of quality from the start of the project.
Own the evaluation loop and the golden datasets that anchor it- Build offline evals, LLM-as-judge with calibration, regression suites, and online metrics. Maintain versioned, decontaminated golden datasets covering intents, difficulty, edge cases, and adversarial inputs, and continuously enrich them with real production failures and SME-validated labels. The measurement plan is what decides whether a feature ships.
Make AI features safe- Treat what you deliver as a regulated product. Design for relevant compliance frameworks (e.g., COPPA, FERPA) from day one, run safety and bias evals before launch and continuously after, and build the human-in-the-loop and content-filtering controls AI features need before they reach end users.
Hand off what you ship- Production features leave your hands with documentation, runbooks, an eval harness, and dashboards. An embed is not complete until the receiving team has shipped a change to the system without you in the room. Typical embed length is 4 to 12 weeks.
Make AI features easier for the rest of engineering to build- Internal libraries, patterns, and playbooks so other teams can ship AI features without your direct involvement.
What You Need:
At least a year of hands-on experience building agentic systems on at least one modern stack (LangGraph, the Anthropic SDK / Claude Agent SDK, OpenAI Agents SDK, Pydantic-AI, Mastra, LlamaIndex, CrewAI, or a homegrown stack). We care that you have built and operated agentic systems in production, not which framework.
Strong Python familiarity plus one typed language for production services (TypeScript, Go, or similar). Cloud experience (AWS or GCP) and containerized deployment.
Senior or staff-level software engineering foundation (formal or autodidact), with several years of production environment experience and a track record of leading systems to launch.
Multiple shipped LLM-powered features in a production environment, with concrete stories about what broke, how you fixed it, and what you would do differently.
Practical knowledge of common agentic patterns: ReAct, tool use with structured schemas, prompt chaining, routing, orchestrator-workers, evaluator-optimizer / reflection, and human-in-the-loop. You can decide when a deterministic workflow is the right answer instead of an autonomous agent.
Hands-on experience in a production environment with retrieval: chunking, embeddings, hybrid search, re-ranking, metadata filtering, and the failure modes of each. Working knowledge of grounding techniques that anchor generated answers in retrieved evidence (citation and quote extraction, faithfulness and refusal evals, post-hoc consistency checks).
Strong prompt-engineering practice: zero-shot, few-shot, and many-shot patterns; example selection and ordering; in-context learning and chain-of-thought.
Comfort with structured output and validation in production (provider-native structured outputs, Instructor, Pydantic-AI, Outlines, or a comparable approach).
Disciplined evaluation practice. You do not rely on subjective review to decide whether a system is ready.
Strong written and verbal communication. You can explain an architectural trade-off to an executive and to a junior engineer in the same week.
You are comfortable using AI coding tools heavily in your implementation workflow while you own problem framing, design choices, and verification. We measure your output by working systems delivered, not lines of code written.
Nice to Have:
Advanced retrieval experience: GraphRAG, agentic retrieval, evaluation-driven retrieval tuning, and hybrid retrieval at scale.
Direct or transferable experience with safety, privacy, and policy constraints in user-facing AI. K-12 or other regulated-domain experience is a strong plus.
Experience with prompt-optimization frameworks (DSPy, TEXTGRAD, AdalFlow) where they paid off in production.
A public repo, package, gist, or technical write-up of meaningful AI work, or a representative project you can describe in detail under your confidentiality constraints.
Open-source contributions to AI tooling (frameworks, agents, evals, MCP servers).
*A note on our interview process
One of our interview rounds is an AI-assisted coding session. Bring your own setup (IDE, AI coding assistant, agentic tools, whatever you actually use day to day) and solve a realistic problem live with AI in the loop. We are looking at how you collaborate with AI tools: how you prompt, validate output, catch bad suggestions, decide when to override, and produce code that meets production standards. It is not a cleanroom algorithm round.