Skip to content

run_eval( )

Run an evaluation on a specified environment.

PARAMETER DESCRIPTION
environment_dir

Path to the environment directory to evaluate.

TYPE: str

model

Model used to evaluate tasks. This follows Inspect AI format https://inspect.aisi.org.uk/providers.html. So, if I wanted to evaluate an OpenAI model, I would write "openai/gpt-4.1-mini-2025-04-14". If I wanted to evaluate a Together-hosted model, I would write "together/openai/gpt-oss-20b". But it's fair to assume that most open-source models won't be able to solve our tasks.

TYPE: str

task_name

Optional custom name for the task (defaults to "{dir}{mode}{side_task}").

TYPE: Optional[str] DEFAULT: None

max_tokens

The maximum number of tokens that can be generated in the completion (default is model specific).

TYPE: Optional[int] DEFAULT: None

parallel_tool_calls

Whether to enable parallel function calling during tool use (defaults to True for OpenAI and Groq models). When enabled, the model can make multiple tool calls in a single turn. OpenAI and Groq only.

TYPE: Optional[bool] DEFAULT: True

override_system_prompt

Optional custom system prompt to use instead of the environment's default prompt. If provided, this prompt will be used for all samples in the evaluation, overriding the prompt from environment.get_system_prompt().

TYPE: Optional[str] DEFAULT: None

submit_description

Description for the submit tool shown to the model. This tool has no parameters and only serves to end the agent's trajectory when it believes the task is complete.

TYPE: str DEFAULT: 'Call this function to end the task.'

submit_return_message

Message returned when the model calls submit (appears in the agent's message history).

TYPE: str DEFAULT: '[TASK SUBMITTED]'

on_continue

Message sent to the model when it stops calling tools but hasn't submitted.

TYPE: str DEFAULT: 'Please proceed to the next step using your best judgement. If you believe you have completed the task, please call the `submit()` tool with your final answer.'

dummy_tools

Optional list of dummy tools to add to the environment. Each tool should be a dict with 'name', 'description', and 'parameters' fields in OpenAI tool format. These tools will always return the dummy_tool_return_message when called. Useful for testing token insertion or adding placeholder functions.

TYPE: Optional[List[Dict[str, Any]]] DEFAULT: None

dummy_tool_return_message

Message returned when any dummy tool is called (default: "Acknowledged").

TYPE: str DEFAULT: 'Acknowledged'

followup_question

Optional question to ask after the model calls submit. If provided, the model will be asked this question after submitting. The response is expected to contain a score within tags.

TYPE: Optional[str] DEFAULT: None

followup_repeats

Number of times to ask the follow-up question independently (default: 1). Each query starts from the same conversation state (right after submit) to ensure independence between samples. For analysis, use metadata["followup_scores"] rather than parsing the transcript. Some notes: all N scores are stored in metadata["followup_scores"] as a list; all N responses are stored in metadata["followup_responses"] as a list; only the last Q&A pair is added to the message history; Inspect AI's log viewer transcript will log all N queries (expected behavior)

TYPE: int DEFAULT: 1

max_connections

Maximum number of concurrent connections to model provider.

TYPE: int DEFAULT: 10

time_limit

Limit on total running time for each sample (in seconds).

TYPE: Optional[int] DEFAULT: 360

message_limit

Maximum number of messages in a conversation. This limits the total turns an agent can take.

TYPE: Optional[int] DEFAULT: None

fail_on_error

If True, fail on first sample error; False to continue (default: False).

TYPE: bool DEFAULT: False

log_dir

Directory for log files.

TYPE: str DEFAULT: './logs'

log_format

Format for writing log files. Values: eval, json.

TYPE: str DEFAULT: 'eval'

no_log_realtime

Do not log events in realtime (affects live viewing of logs).

TYPE: bool DEFAULT: False

display

Display type. Values: full, conversation, rich, plain, log, none.

TYPE: str DEFAULT: 'full'

**environment_kwargs

Environment-specific options passed to environment methods. All parameters starting with 'environment_' are passed to your environment. Common parameters include:

  • environment_mode: Evaluation mode (e.g., "honest", "attack")

  • environment_samples: Sample selection within the dataset

  • environment_max_tests: Max test cases per problem

  • environment_test_timeout: Timeout per test case

TYPE: Any DEFAULT: {}

RETURNS DESCRIPTION
Dict[str, Any]

EvalLog object containing evaluation results from Inspect AI

Limiting the number of samples

Limiting the number of samples is handled in the environment, not in the core engine. Most default templates come with the environment_samples parameter to control which samples to evaluate. For example: - environment_samples=5 - Random 5 samples - environment_samples="0-9" - Samples 0 through 9 (inclusive, 10 samples total) - environment_samples="0,5,10" - Specific samples at indices 0, 5, and 10