Building Robust Evaluations for Production-Ready LLM Workflows

When shipping workflows powered by large language models (LLMs) into production, quality assurance isn't just essential—it's non-negotiable. At Treater, we've developed an LLM workflow evaluation pipeline, designed to automate quality control, continuously improve LLM-generated outputs, and confidently deploy high-quality LLM interactions at scale.

This post will cover:

Evaluation System Architecture: Our multi-layered defense system for ensuring quality outputs

Deterministic Evals: Quick rule-based checks that catch obvious errors
LLM-Based Evals: Interpretable assessments for nuanced quality judgments
Automatic Rewriting: Self-healing capabilities for failed evaluations
Human Edit Analysis: Learning from human corrections

Insights That Shaped Our Approach: Critical lessons we've learned including observability, examples, evaluation approaches, and our Prompt Engineering Studio

Next Steps for Continuous Improvement: Where we're headed next

Conclusion: Key takeaways and future outlook

Evaluation System Architecture

Our evaluation pipeline is structured to quickly identify and fix issues in LLM-generated content, as well as continuously improve our generation pipeline. Here's how each component works to ensure the system is continuously performing within guardrails and improving its outputs.

1. Deterministic Evals: The Safety Net

Deterministic evals are straightforward, rule-based checks that enforce basic standards:

Character limits: Ensuring outputs stay within defined constraints for preferred content lengths.
Formatting consistency: Verifying structural integrity of responses.
Avoidance of banned terms: Blocking inappropriate or imprecise language.
Detection of formatting errors: Catching gibberish or malformed outputs (e.g., unexpected XML, JSON, or Markdown).

These rapid checks filter out obvious errors early in the pipeline, acting as a safety net to prevent them from reaching more resource-intensive stages.

In practice, we observe that failure rate at this stage is under 2%, but with stochastic nature of LLMs, it's important to have these in place.

2. LLM-Based Evals: Nuanced Assessments

For subtler issues —- like tone, clarity, or specific guideline adherence -- we employ LLM-driven evals. We find LLMs as a judge to be excellent at verifying correctness of the output of our generative prompts:

Multiple asynchronous checks: We run multiple calls in parallel, covering relevance, style, and adherence to specific guidelines.
Deep context evaluations: Assessing previous interactions with LLMs to ensure contextual appropriateness.
Interpretability: We require LLMs to explain their judgments step by step, not just provide pass/fail verdicts.

These evals catch nuanced flaws that deterministic rules might miss, ensuring outputs are not only correct but also aligned with brand guidelines and user expectations.

The interpretability of evaluations has proven crucial—we always require our judge LLMs to explain their reasoning. Unlike black-box evaluation systems, LLM judges provide detailed explanations that make failures actionable, allowing us to rapidly debug, fix issues, and continuously improve our system.

The "reasoning" behind failures:

Accelerate debugging: Providing clear paths to identify and fix issues rather than just flagging failures.
Enable automated improvements: Feeding directly into our rewriting system to guide targeted fixes.
Support prompt refinement: Offering insights that can be systematically incorporated into better prompts.
Reveal evaluation blind spots: Helping us identify when the evaluation itself needs refinement.

For example, when an eval indicates "reads too formally," the explanation might specify that "technical jargon like 'utilize' and 'implement' creates unnecessary distance from the reader," providing actionable feedback for both immediate fixes and long-term improvements.

3. Automatic Rewriting System

When evals fail, our rewriting system automatically revises outputs:

Input sources: Original content and failed test details with explanations for why they failed, recent edits to similar content that produced successful results, and relevant guidelines.
LLM rewriter: Generates improved versions that directly address identified issues.
Fix-and-verify loop: Revised outputs are re-evaluated to confirm they meet standards.

For example, if content is flagged for having too much jargon, the rewriter adjusts it based on previous edit patterns that led to successful content -- perhaps simplifying phrasing or using analogies. This creates an efficient, automated correction process.

It's important to note that while our rewriting system is valuable, we view it as a safety net -- not a crutch. Our philosophy is that:

First-pass generation should be accurate: Ideally, most outputs would pass evaluations without requiring rewrites.
Rewrites serve as temporary patches: When we observe consistent rewriting patterns, we incorporate those lessons back into our core generation prompts.
The feedback loop is critical: Rewrite activity serves as a signal for where our generation system needs improvement.

This approach prevents the rewriter from masking fundamental issues in our generation prompts. By tracking what gets rewritten and why, we continuously refine our primary generation system, reducing the need for rewrites over time.

Automatic Rewriting System Workflow

Input Sources

Original Content
Failed Test Details
Recent Successful Edits
Guidelines

Evaluation

Deterministic Checks
LLM-Based Evals

LLM Rewriter

Generates Improved Content
Addresses Identified Issues

Verification

Re-evaluation
Confirm Standards
Compliance

Fix-and-Verify Loop

Fig 1: Our automated rewriting workflow showing how inputs from multiple sources feed into evaluation, rewriting, and verification.

4. Human Edit Analysis: Continuous Feedback Loop

Human expertise is invaluable for refining LLM outputs. Our human edit analysis system:

Daily analysis: Examines differences between LLM-generated and human-edited outputs.
LLM-driven insights: Categorizes edits and suggests prompt or guideline updates. We feed a reasoning model the daily edits, all the evaluations we have, and ask it to find patterns in the edits and how it corresponds to missing evaluation guidelines we have.
Iterative improvements: Insights are shared with the engineering team via Slack, driving ongoing refinement.

For instance, if our team's human edits consistently simplify language, we adjust our prompts to prioritize clarity upfront. This feedback loop ensures our pipelines evolve in line with real-world needs.

RLHF-Inspired Continuous Improvement

Our evaluation pipeline fundamentally aims to drive the diff between LLM-generated outputs and human-edited outputs to zero. This mirrors the philosophy behind Reinforcement Learning from Human Feedback (RLHF), but applied to our specific context:

Human feedback as ground truth: Just as RLHF uses human preferences to optimize models, we use human edits to refine our system.
Hyperparameter optimization: We systematically tune prompts, temperature settings, model choices, and even provider selection based on human feedback.
Refinement cycles: Insights from human edits feed into the analyzer, and those insights are used to refine the prompts.

This approach has systematically reduced the gap between LLM-generated and human-quality outputs, with measurable improvements in acceptance rates and decreasing edit volumes over time.

Eventually, we may use this data to refine a pre-trained model. However, this test-time prompt-engineering approach is extremely cost-effective and enables us to iterate and react within minutes.

Insights That Shaped Our Approach

Building this pipeline has revealed critical lessons about what truly moves the needle in LLM-powered systems.

Observability First: Finding Hidden Bugs

When building complex evaluation systems, observability must be your first priority.

The harder something is to measure and understand, the more likely it hides easy-to-fix problems.

When we first started, many pipeline issues were only discovered through painful manual reviews because key stages lacked adequate tracking. To resolve this, we implemented:

Comprehensive tracking: Saving inputs, outputs, prompt versions, and evaluation checkpoints. We make it easy for us to observe exactly what happened every step of the way.
Dynamic completion graphs: Visually mapping the unique paths each LLM-generated request takes, highlighting bottlenecks and redundant processes.
Tight integration with prompts and evals: Ensuring observability tools can trace an output's journey from prompt to evaluation to potential rewrite.

These tools drastically simplified debugging and optimization. It's now easy for us to see at what point in a multi-step pipeline LLM outputs start deviating from desired results.

We've learned that observability, prompt engineering, and evaluations must function as an integrated system with continuous feedback between components. Isolated measurements provide limited value—the real insights come from tracing the full lifecycle of each output and understanding the relationships between system components.

Binary Variables beat Continuous ones in Evaluations

One approach we tried early on—and subsequently abandoned—was using numeric scoring systems for evaluations (e.g., "Rate this output's clarity from 1-10"). In practice, these scores proved problematic:

Hallucinated precision: Models would confidently output scores without consistent reasoning.
Threshold ambiguity: Determining what score constituted a "pass" was subjective and inconsistent.
Limited actionability: A score of "7/10" provided little guidance on what specifically needed improvement.

Instead, we now design all our evaluations as binary pass/fail tests with clear criteria and required explanations. For example, rather than "Rate jargon 1-10," we ask "Is this output jargon-heavy? (yes/no). Explain your reasoning with specific examples."

This approach yields more consistent, interpretable, and actionable results that directly inform our rewriting and improvement processes.

Context is King

Incorporating contextual awareness - such as previous interactions with our customers - significantly improved evaluation accuracy and LLM-generated content relevance. Initially, evaluating outputs in isolation led to irrelevant results. By including prior context, we boosted coherence and ensured outputs were appropriate for the entire interaction.

The single highest-impact improvement to our system came from providing relevant examples to our LLMs during generation. While sophisticated evaluation and rewriting systems are valuable, nothing beats showing the model what good looks like.

The challenge wasn't in the prompting technique itself, but in designing systems that intelligently:

Capture metadata alongside each previous output of an LLM
Store human edits with annotations on why changes were made
Efficiently retrieve the most contextually appropriate examples at test time

For example, when asking for information about a specific product's recall process, our system now identifies and provides examples of previous, similar queries, dramatically improving first-pass quality.

Next Steps for Continuous Improvement

We're committed to evolving our pipeline. Here are our strategic focus areas, which we are already working on:

Prompt Engineering Studio: Our Observability and Evaluation Tooling

Building on our emphasis on observability, we developed a tooling suite for our LLM pipelines - which involve an average of 8-10 interconnected LLM calls across 10 different prompts for first-pass generation alone. This tooling is essential because even minor prompt variations can compound, significantly affecting final outputs.

Our Prompt Engineering Studio serves as both an observability platform and evaluation environment that includes:

Simulation Environment for System-Wide Testing

Rather than evaluating isolated prompts, our simulator tests entire previously-executed pipeline runs as a unified system. The simulator can be run with a subset of prompts or tools modified to behave differently than the original version, and tested against outputs that have previously been validated.

This holistic approach is crucial because it provides visibility into:

Emergent behaviors: Interactions between agentic steps often produce results that can't be predicted by testing individual components.
Cumulative effects: Small variations compound across multiple calls, creating significant downstream impacts.
System-level metrics: We can measure end-to-end performance, not just intermediate successes.

Fig 4: Simulation metrics dashboard showing cosine similarity scores and differences between human and LLM-generated outputs.

The simulator tracks key metrics including:

Cosine similarity between LLM-generated outputs and human-validated outputs
Differences that show exactly what changed
Statistical breakdowns of changes in performance

This level of detail allows us to quantify the impact of any change to the system, whether it's a prompt adjustment, temperature setting, model switch, or provider change.

Comparative Analysis Against Historical Data

Each simulation run automatically evaluates outputs against:

Prior system-generated outputs
Human-validated outputs
Gold-standard reference outputs

This approach provides comprehensive feedback on whether changes represent actual improvements or regressions, allowing us to make data-driven decisions with confidence.

A/B Testing of Prompts

We believe that implementing structured A/B testing for prompts can provide clear insights into how we should better prompt engineer. By measuring evaluation pass rates and human satisfaction, we can rapidly iterate toward optimal outputs.

Automated Prompt Improvement

We are in the early days of automatic prompt improvement. While we've built a powerful human edit analysis system that provides insights, we believe there are opportunities to further automate this process.

Inspired by methodologies like DSPy, we aim to automate the prompt improvement cycle while maintaining the human readability and interpretability of our prompts.

Frameworks like DSPy offer a promising approach to programmatically optimize prompts, but they often sacrifice transparency and human readability in exchange for optimization. Our goal is to strike a balance—leveraging automation where it makes sense, while ensuring our prompts remain:

Human readable: Engineers and content experts can understand and manually adjust them when needed
Interpretable: Clear enough that we can trace failures to specific prompt elements
Modular: Components can be independently optimized or replaced

In our experiments, we use insights from our human edit analysis as input data for automated prompt refinement systems. This approach maintains the strengths of our current process while gradually increasing automation where it provides clear benefits.

Conclusion

Creating a robust evaluation pipeline is critical for safely deploying LLM-driven content. By prioritizing observability, effectively combining eval types, valuing contextual understanding, and harnessing human feedback, we ensure our LLM-generated outputs are consistently high-quality, accurate, and aligned with brand standards.

Our system, strengthened by ongoing iteration and thoughtful engineering, is poised for continuous evolution. As we look ahead, we're excited to push the boundaries of what LLMs can achieve, ensuring they meet exacting standards with every interaction.

LLM-based products are only one piece of the pie at Treater, but is some of the most fun work we do, given the novelty and difficulty of the space. If you're interested in this or any of the other engineering problems we tackle, reach out to me at [email protected] or connect with me directly on LinkedIn.

Back to all blog posts