run_eval( )¶

Run an evaluation on a specified environment.

PARAMETER	DESCRIPTION
`environment_dir`	Path to the environment directory to evaluate. TYPE: `str`
`model`	Model used to evaluate tasks. This follows Inspect AI format https://inspect.aisi.org.uk/providers.html. So, if I wanted to evaluate an OpenAI model, I would write "openai/gpt-4.1-mini-2025-04-14". If I wanted to evaluate a Together-hosted model, I would write "together/openai/gpt-oss-20b". But it's fair to assume that most open-source models won't be able to solve our tasks. TYPE: `str`
`task_name`	Optional custom name for the task (defaults to "{dir}{mode}{side_task}"). TYPE: `Optional[str]` DEFAULT: `None`
`max_tokens`	The maximum number of tokens that can be generated in the completion (default is model specific). TYPE: `Optional[int]` DEFAULT: `None`
`parallel_tool_calls`	Whether to enable parallel function calling during tool use (defaults to True for OpenAI and Groq models). When enabled, the model can make multiple tool calls in a single turn. OpenAI and Groq only. TYPE: `Optional[bool]` DEFAULT: `True`
`override_system_prompt`	Optional custom system prompt to use instead of the environment's default prompt. If provided, this prompt will be used for all samples in the evaluation, overriding the prompt from environment.get_system_prompt(). TYPE: `Optional[str]` DEFAULT: `None`
`submit_description`	Description for the submit tool shown to the model. This tool has no parameters and only serves to end the agent's trajectory when it believes the task is complete. TYPE: `str` DEFAULT: `'Call this function to end the task.'`
`submit_return_message`	Message returned when the model calls submit (appears in the agent's message history). TYPE: `str` DEFAULT: `'[TASK SUBMITTED]'`
`on_continue`	Message sent to the model when it stops calling tools but hasn't submitted. TYPE: `str` DEFAULT: 'Please proceed to the next step using your best judgement. If you believe you have completed the task, please call the `submit()` tool with your final answer.'
`dummy_tools`	Optional list of dummy tools to add to the environment. Each tool should be a dict with 'name', 'description', and 'parameters' fields in OpenAI tool format. These tools will always return the dummy_tool_return_message when called. Useful for testing token insertion or adding placeholder functions. TYPE: `Optional[List[Dict[str, Any]]]` DEFAULT: `None`
`dummy_tool_return_message`	Message returned when any dummy tool is called (default: "Acknowledged"). TYPE: `str` DEFAULT: `'Acknowledged'`
`followup_question`	Optional question to ask after the model calls submit. If provided, the model will be asked this question after submitting. The response is expected to contain a score within tags. TYPE: `Optional[str]` DEFAULT: `None`
`followup_repeats`	Number of times to ask the follow-up question independently (default: 1). Each query starts from the same conversation state (right after submit) to ensure independence between samples. For analysis, use metadata["followup_scores"] rather than parsing the transcript. Some notes: all N scores are stored in metadata["followup_scores"] as a list; all N responses are stored in metadata["followup_responses"] as a list; only the last Q&A pair is added to the message history; Inspect AI's log viewer transcript will log all N queries (expected behavior) TYPE: `int` DEFAULT: `1`
`max_connections`	Maximum number of concurrent connections to model provider. TYPE: `int` DEFAULT: `10`
`time_limit`	Limit on total running time for each sample (in seconds). TYPE: `Optional[int]` DEFAULT: `360`
`message_limit`	Maximum number of messages in a conversation. This limits the total turns an agent can take. TYPE: `Optional[int]` DEFAULT: `None`
`fail_on_error`	If True, fail on first sample error; False to continue (default: False). TYPE: `bool` DEFAULT: `False`
`log_dir`	Directory for log files. TYPE: `str` DEFAULT: `'./logs'`
`log_format`	Format for writing log files. Values: eval, json. TYPE: `str` DEFAULT: `'eval'`
`no_log_realtime`	Do not log events in realtime (affects live viewing of logs). TYPE: `bool` DEFAULT: `False`
`display`	Display type. Values: full, conversation, rich, plain, log, none. TYPE: `str` DEFAULT: `'full'`
`**environment_kwargs`	Environment-specific options passed to environment methods. All parameters starting with 'environment_' are passed to your environment. Common parameters include: environment_mode: Evaluation mode (e.g., "honest", "attack") environment_samples: Sample selection within the dataset environment_max_tests: Max test cases per problem environment_test_timeout: Timeout per test case TYPE: `Any` DEFAULT: `{}`

RETURNS	DESCRIPTION
`Dict[str, Any]`	EvalLog object containing evaluation results from Inspect AI

Limiting the number of samples

Limiting the number of samples is handled in the environment, not in the core engine. Most default templates come with the environment_samples parameter to control which samples to evaluate. For example: - environment_samples=5 - Random 5 samples - environment_samples="0-9" - Samples 0 through 9 (inclusive, 10 samples total) - environment_samples="0,5,10" - Specific samples at indices 0, 5, and 10