Pure Dart helpers for LLM evaluations built on top of package:test.
eval(...) is the center of the package. It wraps normal Dart tests so you can
run the same evaluation across multiple models and multiple runs while collecting
pass/fail counts and score statistics.
dependencies:
eval: ^0.0.2eval(...)wraps normalpackage:testgroups/tests.- It runs one evaluation body for every service in
apiServicesand every run innumberOfRunsPerLLM. expect(...)is the sync assertion wrapper. It still behaves like a normal Dart test expectation, but insideeval(...)it also tracks pass/fail counts.expectAsync(...)is forAsyncLlmMatchermatchers such as LLM-as-judge and RAG metrics. It tracks pass/fail counts and records numeric scores, which power the final statistics output.evalCompare(...)is for comparing multiple prompt or generation variants.
If you only want ordinary Dart tests, you can still use test(...). This package
just gives you a better default wrapper for LLM evaluations.
Most users only need:
import 'package:eval/eval.dart';package:eval/eval.dart re-exports most of package:test/test.dart, so common
matchers such as equals, contains, isA, and isNot are already available.
If you also import package:test/test.dart directly, hide expect to avoid a
name collision with eval's statistics-aware expect(...):
import 'package:test/test.dart' hide expect;import 'dart:io';
import 'package:eval/eval.dart';
Future<void> main() async {
final apiKey = Platform.environment['ANTHROPIC_API_KEY'] ?? '';
if (apiKey.isEmpty) {
throw StateError('Set ANTHROPIC_API_KEY before running this eval.');
}
await eval(
'answers geography questions',
(apiService) async {
final answer = await apiService.sendRequest(
'Answer in one short sentence: What is the capital of France?',
);
expect(answer, containsIgnoreCase('paris'));
expect(answer, sentenceCountBetween(1, 2));
await expectAsync(
answer,
answersQuestion(
'What is the capital of France?',
apiService: apiService,
),
);
await expectAsync(
answer,
isNotToxic(apiService: apiService),
);
},
apiServices: [
ExampleClaudeService(
defaultModel: ExampleClaudeModel.haiku45,
apiKey: apiKey,
),
ExampleClaudeService(
defaultModel: ExampleClaudeModel.sonnet45,
apiKey: apiKey,
),
],
numberOfRunsPerLLM: 3,
verbose: true,
);
}That gives you a normal test run plus eval-specific output:
- one grouped run per model
- repeated runs per model
- pass/fail tracking for every
expect(...)andexpectAsync(...) - score summaries for async judge-based assertions
Use raw test(...) when you want a normal unit test. Use eval(...) when you
want LLM evaluation behavior:
- fan out the same eval across multiple services/models
- repeat each eval multiple times
- keep using familiar matcher-style assertions
- collect aggregate statistics such as pass rates and judge score summaries
- print a final report at the end of the run
In other words: eval(...) is a wrapper around normal Dart tests, not a different
testing model.
Use regular expect(...) for deterministic matchers.
The sections below intentionally list every public sync matcher exported by the package.
Available string matchers:
containsIgnoreCase(...): substring check without case sensitivitymatchesPattern(...): regex or pattern matchcontainsAllWords(...): all words must appear as whole wordscontainsAnyOf(...): at least one candidate substring must appearcontainsNoneOf(...): none of the candidate substrings may appearwordCountBetween(...): inclusive word-count boundssentenceCountBetween(...): inclusive sentence-count bounds
expect(text, containsIgnoreCase('hello'));
expect(text, matchesPattern(r'\d{3}-\d{4}'));
expect(text, containsAllWords(['important', 'keywords']));
expect(text, containsAnyOf(['error', 'warning']));
expect(text, containsNoneOf(['forbidden']));
expect(text, wordCountBetween(50, 100));
expect(text, sentenceCountBetween(3, 5));Available JSON matchers:
isValidJson: any valid JSON valueisJsonObject: JSON that decodes to an object /MapisJsonArray: JSON that decodes to an array /ListhasJsonKey(...): top-level key existshasJsonPath(...): dot-notation path exists, including array indiceshasJsonPathValue(...): dot-notation path equals a specific value
expect(jsonString, isValidJson);
expect(jsonString, isJsonObject);
expect('[{"id":1},{"id":2}]', isJsonArray);
expect(jsonString, hasJsonKey('status'));
expect(jsonString, hasJsonPath('user.address.city'));
expect(jsonString, hasJsonPathValue('status', 'active'));Available JSON schema/data-shape matchers:
matchesSchema(...): validate against the supported JSON Schema subsethasRequiredFields(...): require named fields with expected Dart typesjsonArrayLengthBetween(...): require an array path length within boundsfieldOneOf(...): require a field/path value to be in an allowed setfieldHasType(...): require a field/path value to have a specific type
expect(jsonString, matchesSchema({
'type': 'object',
'required': ['name', 'email'],
'properties': {
'name': {'type': 'string', 'minLength': 1},
'email': {'type': 'string'},
'age': {'type': 'integer', 'minimum': 0},
},
}));
expect(jsonString, hasRequiredFields({
'id': int,
'name': String,
'active': bool,
}));
expect(jsonString, jsonArrayLengthBetween('items', 1, 5));
expect(jsonString, fieldOneOf('status', ['pending', 'active', 'done']));
expect(jsonString, fieldHasType('age', int));Available frontmatter/body matchers:
hasValidFrontmatter: valid YAML frontmatter is presenthasFrontmatterTitle: frontmatter contains a non-emptytitlehasFrontmatterKey(...): key exists in frontmatterhasFrontmatterValue(...): key has an exact valuefrontmatterKeyMatches(...): key value satisfies another matcherhasMarkdownBody: non-empty body content exists after frontmatterbodyContains(...): body contains a substringbodyMatches(...): body satisfies another matcher
expect(markdown, hasValidFrontmatter);
expect(markdown, hasFrontmatterTitle);
expect(markdown, hasFrontmatterKey('title'));
expect(markdown, hasFrontmatterValue('draft', false));
expect(markdown, frontmatterKeyMatches('tags', contains('dart')));
expect(markdown, hasMarkdownBody);
expect(markdown, bodyContains('# Heading'));
expect(markdown, bodyMatches(contains('Introduction')));Available frontmatter schema/type helpers:
frontmatterMatchesSchema(...): validate frontmatter against the supported JSON Schema subsetfrontmatterHasRequiredFields(...): require frontmatter fields with expected Dart typesfrontmatterArrayLengthBetween(...): require a frontmatter array length within boundsfrontmatterFieldOneOf(...): require a frontmatter field to be in an allowed setfrontmatterFieldHasType(...): require a frontmatter field to have a specific type
expect(markdown, frontmatterMatchesSchema({
'type': 'object',
'required': ['title'],
'properties': {
'title': {'type': 'string'},
'tags': {'type': 'array', 'items': {'type': 'string'}},
},
}));
expect(markdown, frontmatterHasRequiredFields({
'title': String,
'draft': bool,
}));
expect(markdown, frontmatterArrayLengthBetween('tags', 1, 5));
expect(markdown, frontmatterFieldOneOf('status', ['draft', 'published']));
expect(markdown, frontmatterFieldHasType('count', int));Available distance/similarity matchers:
editDistanceLessThan(...): raw Levenshtein distance must be below a thresholdeditDistanceRatio(...): normalized edit-distance ratio must be below a thresholdjaroWinklerSimilarity(...): Jaro-Winkler similarity must be at least a threshold
expect(text, editDistanceLessThan('expected', 3));
expect(text, editDistanceRatio('expected', 0.2));
expect(text, jaroWinklerSimilarity('expected', 0.9));Judge matchers return AsyncLlmMatcher, so they must be used with
await expectAsync(...).
The list below intentionally names every public LLM-as-judge matcher.
Inside eval(...), the simplest pattern is to pass the current run's
apiService explicitly:
await expectAsync(
answer,
semanticallySimilarTo(
'Expected meaning',
threshold: 0.8,
apiService: apiService,
),
);
await expectAsync(
answer,
isFaithfulTo(
sourceDocument,
threshold: 0.9,
apiService: apiService,
),
);
await expectAsync(answer, isNotToxic(apiService: apiService));
await expectAsync(answer, isNotBiased(apiService: apiService));Available judge matchers include:
semanticallySimilarTo(...): semantic similarity to a reference answeranswersQuestion(...): whether the answer addresses a questionisFaithfulTo(...): whether the answer stays grounded in a source/contextisNotToxic(...): toxicity score must stay below a thresholdisNotBiased(...): bias score must stay below a threshold
You can also set llmMatcherService globally if you want one default judge
service for the whole file, but explicit apiService: wiring is usually clearer.
The list below intentionally names every public RAG matcher.
Available RAG matchers:
contextPrecision(...): how much of the retrieved context is relevantcontextRecall(...): whether the retrieved context covers the needed factsanswerGroundedness(...): whether the answer is supported by the contextsanswerRelevancy(...): whether the answer addresses the queryanswerCorrectness(...): whether the answer matches the ground truthragScore(...): weighted combined RAG score across multiple metrics
await expectAsync(
answer,
contextPrecision(
contexts: retrievedDocs,
query: 'What causes climate change?',
threshold: 0.7,
apiService: apiService,
),
);
await expectAsync(
answer,
contextRecall(
contexts: retrievedDocs,
groundTruth: 'The expected factual answer',
threshold: 0.8,
apiService: apiService,
),
);
await expectAsync(
answer,
answerGroundedness(
contexts: retrievedDocs,
threshold: 0.8,
apiService: apiService,
),
);
await expectAsync(
answer,
answerRelevancy(
query: 'What caused the outage?',
threshold: 0.7,
apiService: apiService,
),
);
await expectAsync(
answer,
answerCorrectness(
groundTruth: 'The expected factual answer',
threshold: 0.8,
apiService: apiService,
),
);
await expectAsync(
answer,
ragScore(
contexts: retrievedDocs,
query: 'What caused the outage?',
groundTruth: 'The expected factual answer',
weights: {
'groundedness': 2.0,
'precision': 1.0,
'recall': 1.0,
'relevancy': 1.0,
},
threshold: 0.8,
apiService: apiService,
),
);Use evaluateRag(...) when you want the component scores and extra metadata,
not just a pass/fail assertion.
final result = await evaluateRag(
answer: generatedAnswer,
contexts: retrievedDocs,
query: 'What is quantum computing?',
groundTruth: 'Quantum computing uses quantum mechanics...',
apiService: apiService,
);
print(result.score);
print(result.contextPrecision);
print(result.contextRecall);
print(result.answerGroundedness);
print(result.answerRelevancy);
print(result.relevantContextIndices);
print(result.unsupportedClaims);
print(result.reason);eval(...) always tracks pass/fail counts. Score statistics are available when
you use expectAsync(...), because async judge matchers produce numeric scores.
Typical verbose output looks like this:
=== Eval Results ===
By Model:
ExampleClaudeService claude-sonnet-4-5-20250929:
Pass Rate: 83% (5/6)
Score: mean=0.81, std=0.06, min=0.74, max=0.88
Percentiles: p50=0.81, p90=0.87, p95=0.88
Overall:
Pass Rate: 83% (10/12)
You can also work with the statistics types directly:
final scores = [0.85, 0.72, 0.91, 0.68, 0.79];
final stats = EvalStatistics.compute(
scores,
passed: scores.where((s) => s >= 0.7).length,
failed: scores.where((s) => s < 0.7).length,
);
print(stats.format(verbose: true));final aggregate = AggregateStatistics.compute(
testRuns: {
'summary with ClaudeService sonnet 1': (3, 1, 4),
'summary with ClaudeService sonnet 2': (4, 0, 4),
},
testScores: {
'summary with ClaudeService sonnet 1': [0.8, 0.9, 0.7, 0.6],
'summary with ClaudeService sonnet 2': [0.85, 0.95, 0.9, 0.88],
},
);
print(aggregate.format(verbose: true));evalCompare(...) is for comparing generation strategies. The apiServices
parameter lists the models that generate outputs. The matchers are the judges.
If the matchers need an LLM service, pass an explicit judge service when you construct them:
final judgeService = ExampleClaudeService(
defaultModel: ExampleClaudeModel.sonnet45,
apiKey: apiKey,
);
await evalCompare(
'Compare prompt strategies',
variants: {
'concise': (service) => service.sendRequest('Be brief: $question'),
'detailed': (service) =>
service.sendRequest('Explain thoroughly: $question'),
},
apiServices: [claudeService, gptService],
matchers: [
answersQuestion(question, apiService: judgeService),
isNotToxic(apiService: judgeService),
],
numberOfRuns: 10,
passThreshold: 0.7,
);import 'dart:convert';
import 'dart:typed_data';
import 'package:eval/eval.dart';
import 'package:http/http.dart' as http;
enum MyModel {
small('my-model-small');
final String modelId;
const MyModel(this.modelId);
}
class MyLlmService extends APICallService<MyModel> {
MyLlmService({required String apiKey})
: super(
baseUrl: 'https://api.example.com/v1/chat',
apiKey: apiKey,
defaultModel: MyModel.small,
timeout: Duration.zero,
stateful: false,
);
@override
Future<String> apiCallImpl(
String prompt,
String? systemPrompt,
MyModel modelName, {
Uint8List? imageBytes,
Uint8List? fileBytes,
}) async {
final response = await http.post(
Uri.parse(baseUrl),
headers: {
'Authorization': 'Bearer $apiKey',
'Content-Type': 'application/json',
},
body: jsonEncode({
'model': modelName.modelId,
'messages': [
if (systemPrompt != null)
{'role': 'system', 'content': systemPrompt},
{'role': 'user', 'content': prompt},
],
}),
);
final json = jsonDecode(response.body) as Map<String, dynamic>;
return json['content'] as String;
}
}dart analyze
dart test
dart pub publish --dry-runMIT