Scorers
Scorers are used to score the output of your LLM call.
Autoevals is a great library of scorers to get you started.
Inline Scorers
If you don’t need your scorer to be reusable, you can define it inline.
import { evalite } from "evalite";
evalite("My Eval", { data: async () => { return [{ input: "Hello" }]; }, task: async (input) => { return input + " World!"; }, scorers: [ { name: "Contains Paris", description: "Checks if the output contains the word 'Paris'.", scorer: ({ output }) => { return output.includes("Paris") ? 1 : 0; }, }, ],});
Creating Reusable Scorers
If you have a scorer you want to use across multiple files, you can use createScorer
to create a reusable scorer.
import { createScorer } from "evalite";
const containsParis = createScorer<string, string>({ name: "Contains Paris", description: "Checks if the output contains the word 'Paris'.", scorer: ({ output }) => { return output.includes("Paris") ? 1 : 0; },});
evalite("My Eval", { data: async () => { return [{ input: "Hello" }]; }, task: async (input) => { return input + " World!"; }, scorers: [containsParis],});
The name
and description
of the scorer will be displayed in the Evalite UI.
Score Properties
The score
function receives three properties on the object passed:
import { createScorer } from "evalite";
const containsParis = createScorer<string, string>({ name: "Contains Paris", description: "Checks if the output contains the word 'Paris'.", scorer: ({ input, output, expected }) => { // input comes from `data` // expected also comes from `data` // output is the output of `task` return output.includes("Paris") ? 1 : 0; },});
These are typed using the three type arguments passed to createScorer
:
import { createScorer } from "evalite";
const containsParis = createScorer< string, // Type of 'input' string, // Type of 'output' string // Type of 'expected'>({ name: "Contains Word", description: "Checks if the output contains the specified word.", scorer: ({ output, input, expected }) => { // output is typed as string! return output.includes(expected) ? 1 : 0; },});
If expected
is omitted, it will be inferred from the type of output
.
Scorer Metadata
You can provide metadata along with your custom scorer:
import { createScorer } from "evalite";
const containsParis = createScorer<string>({ name: "Contains Paris", description: "Checks if the output contains the word 'Paris'.", scorer: (output) => { return { score: output.includes("Paris") ? 1 : 0, metadata: { // Can be anything! }, }; },});
This will be visible along with the score in the Evalite UI.
Creating LLM-As-A-Judge Scorers
Here is a brief guide on building your own LLM-as-a-judge scorer.
We’re looking to improve this feature with a first-class guide in the future.
import { openai } from "@ai-sdk/openai";import { generateObject } from "ai";import { createScorer } from "evalite";import { z } from "zod";
/** * Factuality scorer using OpenAI's GPT-4o model. */export const Factuality = createScorer<string, string, string>({ name: "Factuality", scorer: async ({ input, expected, output }) => { return checkFactuality({ question: input, groundTruth: expected!, submission: output, }); },});
/** * Checks the factuality of a submission, using * OpenAI's GPT-4o model. */const checkFactuality = async (opts: { question: string; groundTruth: string; submission: string;}) => { const { object } = await generateObject({ model: openai("gpt-4o-2024-11-20"), /** * Prompt taken from autoevals: * * {@link https://github.com/braintrustdata/autoevals/blob/5aa20a0a9eb8fc9e07e9e5722ebf71c68d082f32/templates/factuality.yaml} */ prompt: ` You are comparing a submitted answer to an expert answer on a given question. Here is the data: [BEGIN DATA] ************ [Question]: ${opts.question} ************ [Expert]: ${opts.groundTruth} ************ [Submission]: ${opts.submission} ************ [END DATA]
Compare the factual content of the submitted answer with the expert answer. Ignore any differences in style, grammar, or punctuation. The submitted answer may either be a subset or superset of the expert answer, or it may conflict with it. Determine which case applies. Answer the question by selecting one of the following options: (A) The submitted answer is a subset of the expert answer and is fully consistent with it. (B) The submitted answer is a superset of the expert answer and is fully consistent with it. (C) The submitted answer contains all the same details as the expert answer. (D) There is a disagreement between the submitted answer and the expert answer. (E) The answers differ, but these differences don't matter from the perspective of factuality. `, schema: z.object({ answer: z.enum(["A", "B", "C", "D", "E"]).describe("Your selection."), rationale: z .string() .describe("Why you chose this answer. Be very detailed."), }), });
/** * LLM's are well documented at being poor at generating */ const scores = { A: 0.4, B: 0.6, C: 1, D: 0, E: 1, };
return { score: scores[object.answer], metadata: { rationale: object.rationale, }, };};
/** * Use the Factuality eval like so: * * 1. The input (in data()) is a question. * 2. The expected output is the ground truth answer to the question. * 3. The output is the text to be evaluated. */evalite("Factuality", { data: async () => { return [ { // The question input: "What is the capital of France?",
// The expected answer expected: "Paris", }, ]; }, task: async (input) => { // Technically correct, but a nightmare for non-LLM // scorers to evaluate. return ( "The capital of France is a city that starts " + "with a letter P, and ends in 'aris'." ); }, scorers: [Factuality],});