ThirdLaw builds the control layer for enterprise AI. This role involves designing and building real-time evaluation logic to enforce AI safety policies using LLMs, semantic similarity, and classifiers. Engineers will integrate with foundation models and build tools for monitoring and debugging.
Responsibilities
Design and build real-time evaluation logic that determines whether LLM prompts or outputs violate enterprise policies.
Implement evaluation strategies using a mix of semantic similarity, foundation model scoring, rule-based systems, and statistical checks.
Integrate model outputs with downstream enforcement actions (e.g. redaction, escalation, blocking).
Prototype, tune, and productize small language models and prompt templates for classification, labeling, or scoring.
Collaborate with data infrastructure engineers to connect evaluation logic with ingestion and storage layers.
Build tools to observe, debug, and improve evaluator performance across real-world data distributions.
Define abstractions for reusable evaluation components that can scale across use cases.
Requirements
7+ years of experience in ML systems or AI engineering roles, with at least 1–2 years working directly with LLMs, NLP pipelines, or semantic search.
Deep understanding of foundation models (e.g. OpenAI, Claude, Mistral, Llama) and how to work with them via APIs or open source.
WFH.team provides remote job intelligence for candidates and employers, offering confirmed remote job listings and resume-based matching. They also provide various employer-facing hiring tools and public resources for remote work workflows.