Wink - AI原生创新，忠于用户，专属智能体验

Anyone who has deployed an LLM application to production has almost certainly encountered this problem: end-to-end testing shows poor performance, but you can’t tell if the issue is bad context pulled by the retriever, wrong parameters in a tool call, or hallucinations from the large model itself. Troubleshooting ends up being entirely guesswork.

Most existing LLM evaluation solutions treat the entire application as a black box, only judging the quality of the final output. It’s like getting only a total score on an exam without knowing which questions you got wrong — you can’t even find where to start optimizing.

This is exactly the problem DeepEval solves. This open-source LLM evaluation framework only requires one core Python decorator to enable full-stack component-level tracking in just 3 lines of code, and you don’t need to refactor any of your existing business code at all.

![DeepEval code and evaluation interface demo](https://wink.run/image?url=https%3A%2F%2Fpbs.twimg.com%2Fmedia%2FHI6eGW5aoAARN0J%3Fformat%3Djpg%26name%3Dlarge)

The usage is extremely straightforward: add the `@observe` decorator to every component you want to track (retrieval functions, tool calling logic, LLM generation functions, etc.), bind the corresponding evaluation metrics to each component, and after running your application you’ll get a visualized broken-down report with scores and issues for every step clearly displayed.

DeepEval's GitHub repository is at https://github.com/confident-ai/deepeval. It currently has over 15k stars, is fully open-source, supports private deployment, and all your data can be stored on your own servers without any data leakage.

DeepEval logo

Its positioning is essentially Pytest for LLM applications: a unit testing framework purpose-built for LLM apps. It integrates the latest cutting-edge evaluation research from academia, provides dozens of out-of-the-box evaluation metrics, and covers almost all mainstream LLM application scenarios:

- General purpose: G-Eval for custom evaluation criteria, hallucination detection, bias detection, toxicity detection, JSON schema validation, and more

- RAG scenarios: answer relevance, factual consistency, context recall, context precision, and more

- Agent scenarios: task completion rate, tool calling accuracy, step efficiency, plan alignment, and more

- Multi-turn conversation scenarios: knowledge retention, conversation completeness, role consistency, and more

- Multimodal scenarios: text-to-image quality, image-text alignment, image usefulness, and more

- MCP scenarios: MCP service usage rate, task completion rate, and more

All metrics can use any large language model as a judge for scoring, or be calculated with a locally running NLP model — you are not required to bind to any specific LLM provider.

DeepEval running demo

In terms of compatibility, DeepEval integrates seamlessly with almost all mainstream LLM development frameworks, including OpenAI, Anthropic, LangChain, LangGraph, CrewAI, LlamaIndex, Pydantic AI, and more. No matter what tech stack your current project uses, you can integrate it quickly. It also supports integration into CI/CD pipelines, running evaluation automatically on every code commit to prevent performance regression.

In addition to the locally running framework, it also comes with a companion cloud platform, Confident AI, that lets you centrally manage test datasets, view full-stack tracing, generate shareable test reports. It also provides an MCP server that lets you call evaluation capabilities directly in editors like Cursor and Claude Code, so you can complete the full test, debug, optimize workflow without switching interfaces. For users who prefer not to use cloud services, all features can run fully locally to meet compliance requirements.

Confident AI MCP architecture diagram

Getting started is also very simple — just three steps:

1. Install: run `pip install -U deepeval`, supports Python 3.9 and above

2. Write test cases: the logic is the same as regular Pytest cases — just define input, actual output, expected output and corresponding evaluation metrics. Example below:

```python

from deepeval import assert_test

from deepeval.metrics import GEval

from deepeval.test_case import LLMTestCase, SingleTurnParams

def test_case():

correctness_metric = GEval(

name="Correctness",

criteria="Judge whether the actual output matches the expected output",

evaluation_params=[SingleTurnParams.ACTUAL_OUTPUT, SingleTurnParams.EXPECTED_OUTPUT],

threshold=0.5

)

test_case = LLMTestCase(

input="What do I do if my shoes don't fit?",

actual_output="You can get a full free refund within 30 days",

expected_output="We offer a 30-day free full refund policy",

retrieval_context=["All users are eligible for a 30-day free full refund"]

)

assert_test(test_case, [correctness_metric])

```

3. Run tests: Execute the `deepeval test run` command to get your test results

For existing applications that are already written, you only need to add the `@observe` decorator to the corresponding functions — you don't need to modify any internal logic of the functions. It will automatically collect runtime data for each step and complete component-level evaluation. For users who don't want to integrate with Pytest, you can also call the evaluation interface directly in a notebook environment, or use any individual evaluation metric on its own.

Wink Pings

Component-Level Evaluation for LLM Apps With One Python Decorator — No Code Refactoring Needed