run_eval( )¶
Run an evaluation on a specified environment.
| PARAMETER | DESCRIPTION |
|---|---|
environment_dir
|
Path to the environment directory to evaluate.
TYPE:
|
model
|
Model used to evaluate tasks. This follows Inspect AI format https://inspect.aisi.org.uk/providers.html. So, if I wanted to evaluate an OpenAI model, I would write "openai/gpt-4.1-mini-2025-04-14". If I wanted to evaluate a Together-hosted model, I would write "together/openai/gpt-oss-20b". But it's fair to assume that most open-source models won't be able to solve our tasks.
TYPE:
|
task_name
|
Optional custom name for the task (defaults to "{dir}{mode}{side_task}"). |
max_tokens
|
The maximum number of tokens that can be generated in the completion (default is model specific). |
parallel_tool_calls
|
Whether to enable parallel function calling during tool use (defaults to True for OpenAI and Groq models). When enabled, the model can make multiple tool calls in a single turn. OpenAI and Groq only. |
override_system_prompt
|
Optional custom system prompt to use instead of the environment's default prompt. If provided, this prompt will be used for all samples in the evaluation, overriding the prompt from environment.get_system_prompt(). |
submit_description
|
Description for the submit tool shown to the model. This tool has no parameters and only serves to end the agent's trajectory when it believes the task is complete.
TYPE:
|
submit_return_message
|
Message returned when the model calls submit (appears in the agent's message history).
TYPE:
|
on_continue
|
Message sent to the model when it stops calling tools but hasn't submitted.
TYPE:
|
dummy_tools
|
Optional list of dummy tools to add to the environment. Each tool should be a dict with 'name', 'description', and 'parameters' fields in OpenAI tool format. These tools will always return the dummy_tool_return_message when called. Useful for testing token insertion or adding placeholder functions. |
dummy_tool_return_message
|
Message returned when any dummy tool is called (default: "Acknowledged").
TYPE:
|
followup_question
|
Optional question to ask after the model calls submit. If provided, the model will be asked this question after submitting. The response is expected to contain a score within |
followup_repeats
|
Number of times to ask the follow-up question independently (default: 1). Each query starts from the same conversation state (right after submit) to ensure independence between samples. For analysis, use metadata["followup_scores"] rather than parsing the transcript. Some notes: all N scores are stored in metadata["followup_scores"] as a list; all N responses are stored in metadata["followup_responses"] as a list; only the last Q&A pair is added to the message history; Inspect AI's log viewer transcript will log all N queries (expected behavior)
TYPE:
|
max_connections
|
Maximum number of concurrent connections to model provider.
TYPE:
|
time_limit
|
Limit on total running time for each sample (in seconds). |
message_limit
|
Maximum number of messages in a conversation. This limits the total turns an agent can take. |
fail_on_error
|
If True, fail on first sample error; False to continue (default: False).
TYPE:
|
log_dir
|
Directory for log files.
TYPE:
|
log_format
|
Format for writing log files. Values: eval, json.
TYPE:
|
no_log_realtime
|
Do not log events in realtime (affects live viewing of logs).
TYPE:
|
display
|
Display type. Values: full, conversation, rich, plain, log, none.
TYPE:
|
**environment_kwargs
|
Environment-specific options passed to environment methods. All parameters starting with 'environment_' are passed to your environment. Common parameters include:
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
Dict[str, Any]
|
EvalLog object containing evaluation results from Inspect AI |
Limiting the number of samples
Limiting the number of samples is handled in the environment, not in the core engine. Most default templates come with the environment_samples parameter to control which samples to evaluate. For example:
- environment_samples=5 - Random 5 samples
- environment_samples="0-9" - Samples 0 through 9 (inclusive, 10 samples total)
- environment_samples="0,5,10" - Specific samples at indices 0, 5, and 10