xavier collantes

Promptfoo: LLM Testing Tool

By Xavier Collantes

10/1/2025

A massive problem I found in working with LLMs is testing prompts. You can write a prompt, feed it to an LLM and see if it works. But which model is best? Will you go to each model and test it? What metrics should you use?

Promptfoo is an open-source CLI tool that makes this process systematic. If you're ready to graduate from the testing method of looking at the output and saying "yep, looks good to me" to testing by a set of metrics. Promptfoo is a CLI tool that does this for you.

What Promptfoo Does

promptfoo.dev

Promptfoo runs your prompts against test cases, compares outputs across different models, and scores results automatically. It's most similar to unit testing but for LLM prompts.

Testing LLMs is different from testing traditional software because of the non-deterministic nature of LLMs. Outputs from one session can vary even with the same input, model, and temperature.

Metrics

Promptfoo provides several metrics to help you understand the quality of the LLMs with your prompts.

Similarity
Precision
Recall
Cost (in terms of LLM tokens)

Installation

Bash

1npm install -g promptfoo
2
3# MacOS.
4brew install promptfoo
5

snippet hosted withby Xavier

Basic Setup

Create a promptfooconfig.yaml in your project:

We'll be using an example chatbot for Gonzaga University's Admission Office tasked with answering questions about the university. Go Zags!

promptfooconfig.yaml

YAML

1description: "Eval for a customer service chatbot for Gonzaga University's Admissions."
2
3prompts:
4  # You can use files.
5  # - file://../prompts/service_prompt.yaml
6  - |
7    You are a customer service chatbot for Gonzaga University Admissions. You are given a potential student's question and you need to answer it.
8    Refer to https://www.gonzaga.edu/undergraduate-admission/apply/how-to-apply
9
10    {{question}}
11
12providers:
13  - google:gemini-2.5-flash-lite
14
15  # The `.env` file will be loaded automatically.
16  # But you can also specify the API key here.
17  - id: google:gemini-2.5-flash
18    config:
19      apiKey: ${GOOGLE_API_KEY}
20
21  - openai:gpt-4.1-mini
22
23# Assertions applied to every test.
24defaultTest:
25  # For more information on assertions, see https://promptfoo.dev/docs/configuration/expected-outputs
26  assert:
27    # Ensure the assistant doesn't mention being an AI.
28    - type: llm-rubric
29      value: Do not mention that you are an AI or chat assistant
30
31    # JavaScript as logic. Prefer shorter outputs using a scoring function.
32    - type: javascript
33      value: Math.max(0, Math.min(1, 1 - (output.length - 100) / 900));
34
35# Configure evaluation options.
36evaluateOptions:
37  # Useful to not get rate limited by the provider API.
38  delay: 1000
39
40# Set up test cases. Each vars is a different test case.
41tests:
42  # Simple input.
43  - vars:
44      question: What is the deadline for submitting my application?
45
46  - vars:
47      question: What is life like on campus?
48
49    # Per test assertions.
50    assert:
51      # output must contain the word "campus"
52      - type: icontains
53        value: campus
54      - type: icontains
55        value: activities
56
57  - vars:
58      question: What is the cost of attendance?
59    assert:
60      # For more information on model-graded evals, see https://promptfoo.dev/docs/configuration/expected-outputs/model-graded
61      - type: llm-rubric
62        value: ensure that the output is a number
63

promptfooconfig.yaml hosted withby Xavier

Run it:

Bash

1promptfoo eval
2

snippet hosted withby Xavier

If you get errors with this command, make sure you are calling the correct location of the `promptfoo` command. Check with `which promptfoo`

Notice the output shows the comparison between prompts and the status between fail, pass, or error. The matrix format makes the examination of the results easier.

You can also view in the web UI:

Bash

1promptfoo view
2

snippet hosted withby Xavier

The results are displayed token usage and grading based on the assertions you have set.

LLM Grading LLMs

YAML

1assert:
2  - type: llm-rubric
3    value: "Response is professional and under 100 words"
4

snippet hosted withby Xavier

JavaScript Assertions for Custom Logic

For complex logic, you can use JavaScript:

YAML

1assert:
2  - type: javascript
3    value: |
4      output.length < 100 && output.includes('discount') && !output.includes('refund')
5

snippet hosted withby Xavier

$$$: Comparing Models Cost-Effectively

Real-life use case: finding the cheapest model that meets quality standards.

YAML

1providers:
2  - openai:gpt-4 # $0.03/1k tokens (estimated at time of writing)
3  - openai:gpt-3.5-turbo # $0.0015/1k tokens
4  - anthropic:claude-3-haiku-20240307 # $0.00025/1k tokens
5

snippet hosted withby Xavier

Run the eval, then check the results. If Claude Haiku passes 90% of tests and costs 100x less than GPT-4, you have data to make the switch.

Integration with CI/CD

Add Promptfoo to your test pipeline:

.github/workflows/test-prompts.yml

YAML

1name: Test LLM Prompts
2
3on: [pull_request]
4
5jobs:
6  test:
7    runs-on: ubuntu-latest
8    steps:
9      - uses: actions/checkout@v2
10      - uses: actions/setup-node@v2
11      - run: npm install -g promptfoo
12      - run: promptfoo eval
13        env:
14          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
15      - run: promptfoo eval --assert # Fail if any tests fail
16

.github/workflows/test-prompts.yml hosted withby Xavier

Now every prompt change gets tested before merge.

What I Learned

Version control for your prompts: Keep prompts in separate files like prompts/admission_prompt.txt. Makes it easy to compare versions and rollback if needed.

Start with simple assertions: Begin with contains and not-contains. Add LLM grading later once you have baseline tests passing. As always, using LLMs is adding cost and complexity to your system.

Test edge cases: Empty inputs, very long inputs, special characters. These always expose issues.

Run regularly: As opposed to conventional unittest for code, prompt performance can drift as models get updated.

Use cost metrics: Promptfoo can show estimated costs for each test run. Helps justify model choices to stakeholders.

Alternatives

LangSmith: More comprehensive but requires LangChain integration and has a learning curve.

Manual testing: Works for MVPs but doesn't scale and is only as good as "hmm, looks good to me".

Custom scripts: Don't reinvent the wheel. Promptfoo does this well.

xavier collantes

Promptfoo: LLM Testing Tool

What Promptfoo Does

Metrics

Installation

Basic Setup

LLM Grading LLMs

JavaScript Assertions for Custom Logic

$$$: Comparing Models Cost-Effectively

Integration with CI/CD

What I Learned

Alternatives

Further Reading

Related Articles

Testing LLMs

Qdrant vs AWS S3 Vector Store

Build Your Own Search Engine: Databases for AI

Related Articles

Testing LLMs
Approaches to LLM Evals.
By Xavier Collantes10/7/2025
ai
llm
testing
+1
Testing LLMs
Approaches to LLM Evals.
By Xavier Collantes10/7/2025
ai
llm
testing
+1

Qdrant vs AWS S3 Vector Store
My experience comparing the new AWS S3 Vector Store to Qdrant.
By Xavier Collantes8/15/2025
ai
llm
ml
+9
Qdrant vs AWS S3 Vector Store
My experience comparing the new AWS S3 Vector Store to Qdrant.
By Xavier Collantes8/15/2025
ai
llm
ml
+9

Build Your Own Search Engine: Databases for AI
My experience with vector storage solutions for embeddings and similarity search in AI applications.
By Xavier Collantes8/14/2025
ai
llm
ml
+8
By Xavier Collantes8/14/2025