Component-Level Evaluation for LLM Apps With One Python Decorator — No Code Refactoring Needed
Most LLM evaluations only treat the entire application as a black box for end-to-end testing, making it impossible to pinpoint issues in internal components like retrieval, tool calling, or the model itself. DeepEval, the open-source LLM evaluation framework, enables component-level tracking and evaluation with just one Python decorator — no refactoring of your existing code required. It supports dozens of evaluation metrics covering scenarios including RAG, agents, multi-turn conversations, and multimodal, works seamlessly with all mainstream LLM development frameworks, has earned over 15k GitHub stars, and supports private deployment to ensure data security.
Anyone who has deployed an LLM application to production has almost certainly encountered this problem: end-to-end testing shows poor performance, but you can’t tell if the issue is bad context pulled by the retriever, wrong parameters in a tool call, or hallucinations from the large model itself. Troubleshooting ends up being entirely guesswork.
Most existing LLM evaluation solutions treat the entire application as a black box, only judging the quality of the final output. It’s like getting only a total score on an exam without knowing which questions you got wrong — you can’t even find where to start optimizing.
This is exactly the problem DeepEval solves. This open-source LLM evaluation framework only requires one core Python decorator to enable full-stack component-level tracking in just 3 lines of code, and you don’t need to refactor any of your existing business code at all.

The usage is extremely straightforward: add the `@observe` decorator to every component you want to track (retrieval functions, tool calling logic, LLM generation functions, etc.), bind the corresponding evaluation metrics to each component, and after running your application you’ll get a visualized broken-down report with scores and issues for every step clearly displayed.
DeepEval's GitHub repository is at https://github.com/confident-ai/deepeval. It currently has over 15k stars, is fully open-source, supports private deployment, and all your data can be stored on your own servers without any data leakage.
Its positioning is essentially Pytest for LLM applications: a unit testing framework purpose-built for LLM apps. It integrates the latest cutting-edge evaluation research from academia, provides dozens of out-of-the-box evaluation metrics, and covers almost all mainstream LLM application scenarios:
- General purpose: G-Eval for custom evaluation criteria, hallucination detection, bias detection, toxicity detection, JSON schema validation, and more
- RAG scenarios: answer relevance, factual consistency, context recall, context precision, and more
- Agent scenarios: task completion rate, tool calling accuracy, step efficiency, plan alignment, and more
- Multi-turn conversation scenarios: knowledge retention, conversation completeness, role consistency, and more
- Multimodal scenarios: text-to-image quality, image-text alignment, image usefulness, and more
- MCP scenarios: MCP service usage rate, task completion rate, and more
All metrics can use any large language model as a judge for scoring, or be calculated with a locally running NLP model — you are not required to bind to any specific LLM provider.

In terms of compatibility, DeepEval integrates seamlessly with almost all mainstream LLM development frameworks, including OpenAI, Anthropic, LangChain, LangGraph, CrewAI, LlamaIndex, Pydantic AI, and more. No matter what tech stack your current project uses, you can integrate it quickly. It also supports integration into CI/CD pipelines, running evaluation automatically on every code commit to prevent performance regression.
In addition to the locally running framework, it also comes with a companion cloud platform, Confident AI, that lets you centrally manage test datasets, view full-stack tracing, generate shareable test reports. It also provides an MCP server that lets you call evaluation capabilities directly in editors like Cursor and Claude Code, so you can complete the full test, debug, optimize workflow without switching interfaces. For users who prefer not to use cloud services, all features can run fully locally to meet compliance requirements.

Getting started is also very simple — just three steps:
1. Install: run `pip install -U deepeval`, supports Python 3.9 and above
2. Write test cases: the logic is the same as regular Pytest cases — just define input, actual output, expected output and corresponding evaluation metrics. Example below:
```python
from deepeval import assert_test
from deepeval.metrics import GEval
from deepeval.test_case import LLMTestCase, SingleTurnParams
def test_case():
correctness_metric = GEval(
name="Correctness",
criteria="Judge whether the actual output matches the expected output",
evaluation_params=[SingleTurnParams.ACTUAL_OUTPUT, SingleTurnParams.EXPECTED_OUTPUT],
threshold=0.5
)
test_case = LLMTestCase(
input="What do I do if my shoes don't fit?",
actual_output="You can get a full free refund within 30 days",
expected_output="We offer a 30-day free full refund policy",
retrieval_context=["All users are eligible for a 30-day free full refund"]
)
assert_test(test_case, [correctness_metric])
```
3. Run tests: Execute the `deepeval test run` command to get your test results
For existing applications that are already written, you only need to add the `@observe` decorator to the corresponding functions — you don't need to modify any internal logic of the functions. It will automatically collect runtime data for each step and complete component-level evaluation. For users who don't want to integrate with Pytest, you can also call the evaluation interface directly in a notebook environment, or use any individual evaluation metric on its own.
发布时间: 2026-05-23 00:35