When shipping workflows powered by large language models (LLMs) into production, quality assurance isn't just essential—it's non-negotiable. At Treater, we've developed an LLM workflow evaluation pipeline, designed to automate quality control, continuously improve LLM-generated outputs, and confidently deploy high-quality LLM interactions at scale.
This post will cover:
Evaluation System Architecture: Our multi-layered defense system for ensuring quality outputs
Insights That Shaped Our Approach: Critical lessons we've learned including observability, examples, evaluation approaches, and our Prompt Engineering Studio
Next Steps for Continuous Improvement: Where we're headed next
Conclusion: Key takeaways and future outlook
Our evaluation pipeline is structured to quickly identify and fix issues in LLM-generated content, as well as continuously improve our generation pipeline. Here's how each component works to ensure the system is continuously performing within guardrails and improving its outputs.
Deterministic evals are straightforward, rule-based checks that enforce basic standards:
These rapid checks filter out obvious errors early in the pipeline, acting as a safety net to prevent them from reaching more resource-intensive stages.
For subtler issues —- like tone, clarity, or specific guideline adherence -- we employ LLM-driven evals. We find LLMs as a judge to be excellent at verifying correctness of the output of our generative prompts:
These evals catch nuanced flaws that deterministic rules might miss, ensuring outputs are not only correct but also aligned with brand guidelines and user expectations.
The "reasoning" behind failures:
For example, when an eval indicates "reads too formally," the explanation might specify that "technical jargon like 'utilize' and 'implement' creates unnecessary distance from the reader," providing actionable feedback for both immediate fixes and long-term improvements.
When evals fail, our rewriting system automatically revises outputs:
For example, if content is flagged for having too much jargon, the rewriter adjusts it based on previous edit patterns that led to successful content -- perhaps simplifying phrasing or using analogies. This creates an efficient, automated correction process.
It's important to note that while our rewriting system is valuable, we view it as a safety net -- not a crutch. Our philosophy is that:
This approach prevents the rewriter from masking fundamental issues in our generation prompts. By tracking what gets rewritten and why, we continuously refine our primary generation system, reducing the need for rewrites over time.
Fig 1: Our automated rewriting workflow showing how inputs from multiple sources feed into evaluation, rewriting, and verification.
Human expertise is invaluable for refining LLM outputs. Our human edit analysis system:
For instance, if our team's human edits consistently simplify language, we adjust our prompts to prioritize clarity upfront. This feedback loop ensures our pipelines evolve in line with real-world needs.
Our evaluation pipeline fundamentally aims to drive the diff between LLM-generated outputs and human-edited outputs to zero. This mirrors the philosophy behind Reinforcement Learning from Human Feedback (RLHF), but applied to our specific context:
This approach has systematically reduced the gap between LLM-generated and human-quality outputs, with measurable improvements in acceptance rates and decreasing edit volumes over time.
Eventually, we may use this data to refine a pre-trained model. However, this test-time prompt-engineering approach is extremely cost-effective and enables us to iterate and react within minutes.
Building this pipeline has revealed critical lessons about what truly moves the needle in LLM-powered systems.
When building complex evaluation systems, observability must be your first priority.
When we first started, many pipeline issues were only discovered through painful manual reviews because key stages lacked adequate tracking. To resolve this, we implemented:
These tools drastically simplified debugging and optimization. It's now easy for us to see at what point in a multi-step pipeline LLM outputs start deviating from desired results.
We've learned that observability, prompt engineering, and evaluations must function as an integrated system with continuous feedback between components. Isolated measurements provide limited value—the real insights come from tracing the full lifecycle of each output and understanding the relationships between system components.
One approach we tried early on—and subsequently abandoned—was using numeric scoring systems for evaluations (e.g., "Rate this output's clarity from 1-10"). In practice, these scores proved problematic:
Instead, we now design all our evaluations as binary pass/fail tests with clear criteria and required explanations. For example, rather than "Rate jargon 1-10," we ask "Is this output jargon-heavy? (yes/no). Explain your reasoning with specific examples."
This approach yields more consistent, interpretable, and actionable results that directly inform our rewriting and improvement processes.
Incorporating contextual awareness - such as previous interactions with our customers - significantly improved evaluation accuracy and LLM-generated content relevance. Initially, evaluating outputs in isolation led to irrelevant results. By including prior context, we boosted coherence and ensured outputs were appropriate for the entire interaction.
The challenge wasn't in the prompting technique itself, but in designing systems that intelligently:
For example, when asking for information about a specific product's recall process, our system now identifies and provides examples of previous, similar queries, dramatically improving first-pass quality.
We're committed to evolving our pipeline. Here are our strategic focus areas, which we are already working on:
Building on our emphasis on observability, we developed a tooling suite for our LLM pipelines - which involve an average of 8-10 interconnected LLM calls across 10 different prompts for first-pass generation alone. This tooling is essential because even minor prompt variations can compound, significantly affecting final outputs.
Our Prompt Engineering Studio serves as both an observability platform and evaluation environment that includes:
Rather than evaluating isolated prompts, our simulator tests entire previously-executed pipeline runs as a unified system. The simulator can be run with a subset of prompts or tools modified to behave differently than the original version, and tested against outputs that have previously been validated.
This holistic approach is crucial because it provides visibility into:
Fig 4: Simulation metrics dashboard showing cosine similarity scores and differences between human and LLM-generated outputs.
The simulator tracks key metrics including:
This level of detail allows us to quantify the impact of any change to the system, whether it's a prompt adjustment, temperature setting, model switch, or provider change.
Each simulation run automatically evaluates outputs against:
This approach provides comprehensive feedback on whether changes represent actual improvements or regressions, allowing us to make data-driven decisions with confidence.
We believe that implementing structured A/B testing for prompts can provide clear insights into how we should better prompt engineer. By measuring evaluation pass rates and human satisfaction, we can rapidly iterate toward optimal outputs.
We are in the early days of automatic prompt improvement. While we've built a powerful human edit analysis system that provides insights, we believe there are opportunities to further automate this process.
Frameworks like DSPy offer a promising approach to programmatically optimize prompts, but they often sacrifice transparency and human readability in exchange for optimization. Our goal is to strike a balance—leveraging automation where it makes sense, while ensuring our prompts remain:
In our experiments, we use insights from our human edit analysis as input data for automated prompt refinement systems. This approach maintains the strengths of our current process while gradually increasing automation where it provides clear benefits.
Creating a robust evaluation pipeline is critical for safely deploying LLM-driven content. By prioritizing observability, effectively combining eval types, valuing contextual understanding, and harnessing human feedback, we ensure our LLM-generated outputs are consistently high-quality, accurate, and aligned with brand standards.
Our system, strengthened by ongoing iteration and thoughtful engineering, is poised for continuous evolution. As we look ahead, we're excited to push the boundaries of what LLMs can achieve, ensuring they meet exacting standards with every interaction.
LLM-based products are only one piece of the pie at Treater, but is some of the most fun work we do, given the novelty and difficulty of the space. If you're interested in this or any of the other engineering problems we tackle, reach out to me at [email protected] or connect with me directly on LinkedIn.