By Xavier Collantes
10/1/2025
Testing LLMs is different from testing traditional software because of the non-deterministic nature of LLMs. Outputs from one session can vary even with the same input, model, and temperature.
promptfooconfig.yaml
in your project:We'll be using an example chatbot for Gonzaga University's Admission Office tasked with answering questions about the university. Go Zags!
promptfooconfig.yaml
YAML1description: "Eval for a customer service chatbot for Gonzaga University's Admissions."
2
3prompts:
4 # You can use files.
5 # - file://../prompts/service_prompt.yaml
6 - |
7 You are a customer service chatbot for Gonzaga University Admissions. You are given a potential student's question and you need to answer it.
8 Refer to https://www.gonzaga.edu/undergraduate-admission/apply/how-to-apply
9
10 {{question}}
11
12providers:
13 - google:gemini-2.5-flash-lite
14
15 # The `.env` file will be loaded automatically.
16 # But you can also specify the API key here.
17 - id: google:gemini-2.5-flash
18 config:
19 apiKey: ${GOOGLE_API_KEY}
20
21 - openai:gpt-4.1-mini
22
23# Assertions applied to every test.
24defaultTest:
25 # For more information on assertions, see https://promptfoo.dev/docs/configuration/expected-outputs
26 assert:
27 # Ensure the assistant doesn't mention being an AI.
28 - type: llm-rubric
29 value: Do not mention that you are an AI or chat assistant
30
31 # JavaScript as logic. Prefer shorter outputs using a scoring function.
32 - type: javascript
33 value: Math.max(0, Math.min(1, 1 - (output.length - 100) / 900));
34
35# Configure evaluation options.
36evaluateOptions:
37 # Useful to not get rate limited by the provider API.
38 delay: 1000
39
40# Set up test cases. Each vars is a different test case.
41tests:
42 # Simple input.
43 - vars:
44 question: What is the deadline for submitting my application?
45
46 - vars:
47 question: What is life like on campus?
48
49 # Per test assertions.
50 assert:
51 # output must contain the word "campus"
52 - type: icontains
53 value: campus
54 - type: icontains
55 value: activities
56
57 - vars:
58 question: What is the cost of attendance?
59 assert:
60 # For more information on model-graded evals, see https://promptfoo.dev/docs/configuration/expected-outputs/model-graded
61 - type: llm-rubric
62 value: ensure that the output is a number
63
If you get errors with this command, make sure you are calling the correct location of the `promptfoo` command. Check with `which promptfoo`
1assert:
2 - type: llm-rubric
3 value: "Response is professional and under 100 words"
4
1assert:
2 - type: javascript
3 value: |
4 output.length < 100 && output.includes('discount') && !output.includes('refund')
5
Real-life use case: finding the cheapest model that meets quality standards.
1providers:
2 - openai:gpt-4 # $0.03/1k tokens (estimated at time of writing)
3 - openai:gpt-3.5-turbo # $0.0015/1k tokens
4 - anthropic:claude-3-haiku-20240307 # $0.00025/1k tokens
5
.github/workflows/test-prompts.yml
YAML1name: Test LLM Prompts
2
3on: [pull_request]
4
5jobs:
6 test:
7 runs-on: ubuntu-latest
8 steps:
9 - uses: actions/checkout@v2
10 - uses: actions/setup-node@v2
11 - run: npm install -g promptfoo
12 - run: promptfoo eval
13 env:
14 OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
15 - run: promptfoo eval --assert # Fail if any tests fail
16
prompts/admission_prompt.txt
. Makes it easy to compare versions and rollback
if needed.contains
and not-contains
. Add
LLM grading later once you have baseline tests passing. As always, using LLMs is
adding cost and complexity to your system.Related by topics: