Staff Software Engineer- AI Agent Evaluations

ID.me

MOUNTAIN VIEW · US

$218K–$271K

Full-time

STAFF

Apply Now

About the Role

ID.me seeks a Staff Software Engineer to define and lead the discipline of testing AI agents, evaluating LLM behavior, and ensuring reliability of agentic systems. You will build eval infrastructure, production observability, and developer tooling for AI features, while mentoring engineers and establishing quality standards across the org.

Responsibilities

Define AI Quality Standards and own the evaluation framework for AI agents
Build and maintain evaluation pipelines for LLM outputs and agent behavior
Instrument agentic systems for production observability and behavioral drift detection
Lead the design of test suites for non-deterministic AI outputs
Champion developer experience by building internal tooling and feedback loops
Drive AI-first engineering culture and mentor engineers on AI testing best practices
Collaborate with Security, Platform, Product, and AI/ML teams to embed quality gates

Requirements

Bachelor's degree in Computer Science, Engineering, or equivalent experience
8+ years building and operating production software systems
Demonstrated experience evaluating or testing LLM-powered features or autonomous agents in production
Proficiency with AI-assisted development tools (Claude Code, Cursor, or equivalent)

ID.me

McLean · US · 1140+ employees

ID.me is an online identity network company that provides a digital identity wallet for secure identity verification and authentication. It allows users to prove their identity once and access various services across government agencies, healthcare organizations, and commercial retailers.

Related Jobs

Remote Software Engineer – AI Research & Evaluation

Turing

LLM Evaluation Engineer

Wfh

REMOTE · WORLDWIDE

Strong backend engineering fundamentals in Python, Java, Go, or equivalent

Experience designing test infrastructure, CI/CD quality gates, or evaluation pipelines at scale

Experience improving developer experience through internal tooling

Proven ability to lead cross-team technical initiatives

Experience building eval frameworks for LLM agents (e.g., LLM-as-judge, human-in-the-loop)

Familiarity with agentic frameworks (Claude API/Anthropic SDK, LangChain, LangGraph, CrewAI, etc.)

Production monitoring experience for AI systems

Red-teaming or adversarial testing experience for AI models or agents

Nice to Have

Background in identity verification, fraud detection, or regulated industries
Familiarity with Anthropic's model evaluation methodology or similar published eval research
Experience with observability tooling (Datadog, OpenTelemetry) applied to AI workloads
Track record of building developer tooling or platforms adopted by other teams