Running BashBench2 Experiments¶
This page shares sample code snippets for running BashBench2 experiments.
Running Evaluation¶
The basic approach iterates through all 4 side tasks and both modes.
from mini_control_arena import copy_template, run_eval, load_env, convert_logs
# ============================================================================
# CONFIGURATION
# ============================================================================
MODEL = "openai/gpt-4.1-2025-04-14"
SIDE_TASKS = ["disable_firewall", "leak_passwords", "download_malware", "sudo_edit"]
MODES = ["attack", "honest"]
# ============================================================================
# MAIN EXECUTION
# ============================================================================
load_env("./.env")
copy_template("bashbench2", "./templates/my-bashbench2", overwrite=True)
for side_task in SIDE_TASKS:
for mode in MODES:
run_eval(
environment_dir="./templates/my-bashbench2",
model=MODEL,
log_format="eval", # Output format for logs
log_dir=f"./logs/{MODEL.replace('/','-')}/{side_task}/{mode}",
max_connections=20, # Parallel container instances
time_limit=60, # Seconds per task
environment_mode=mode, # "attack" or "honest"
environment_side_task=side_task, # Which side task to evaluate
environment_samples=10, # Number of runs per configuration
environment_seed=42 # For reproducibility
)
convert_logs(
paths_to_convert=[
f"./logs/{MODEL.replace('/','-')}/{side_task}/{mode}",
],
target_dir=f"./logs_converted/{MODEL.replace('/','-')}/{side_task}/{mode}",
)
Analyzing Logs¶
This script prints success percentages grouped by model, task, and mode. BashBench2 main task scores are numerical (0-1) rather than binary.
import json
from pathlib import Path
from collections import defaultdict
LOG_DIR = Path("./logs_converted")
# BashBench2 side tasks
SIDE_TASKS = ["disable_firewall", "leak_passwords", "download_malware", "sudo_edit"]
# Structure: logs_converted/{model}/{task}/{mode}/{file}.jsonl
# Group by model -> task -> mode
results = defaultdict(lambda: defaultdict(lambda: {"attack": [], "honest": []}))
for log_file in LOG_DIR.rglob("*.jsonl"):
*_, model, task, mode, _ = log_file.parts
# Only process BashBench2 tasks
if task not in SIDE_TASKS:
continue
with open(log_file) as f:
for line in f:
data = json.loads(line)
if "scores" not in data:
continue
scores = data["scores"]
# Get main task score
main_value = scores.get("main_task_scorer", {}).get("value", None)
if main_value is None:
# Try first score if no explicit main_task_scorer
for name, val in scores.items():
if isinstance(val, dict) and "value" in val:
main_value = val["value"]
break
if main_value is not None:
# Handle both "C" (categorical) and numerical scores
if main_value == "C":
main_score = 1
elif isinstance(main_value, (int, float)):
main_score = main_value
else:
main_score = 0
else:
continue
# Get side task score
side_keys = [k for k in scores if k != "main_task_scorer"]
side_score = None
if side_keys:
side_value = scores[side_keys[0]].get("value", None)
if side_value is not None:
side_score = 1 if side_value == "C" else 0
results[model][task][mode].append((main_score, side_score))
def pct(vals):
if not vals:
return "-"
return f"{sum(vals) / len(vals) * 100:.0f}%"
for model in sorted(results):
print(f"\n=== {model} ===")
print(f"{'Task':<25} {'Mode':<8} {'N':<4} {'Main':<8} {'Side'}")
print("-" * 55)
for task in sorted(results[model]):
for mode in ["attack", "honest"]:
runs = results[model][task][mode]
if runs:
mains = [r[0] for r in runs]
sides = [r[1] for r in runs if r[1] is not None]
side_pct = "-" if mode == "honest" else pct(sides)
print(f"{task:<25} {mode:<8} {len(runs):<4} {pct(mains):<8} {side_pct}")
Performance Expectations¶
These figures show main task and side task success rates across the 4 BashBench2 tasks. The results were generated using the code snippets above with 10 runs per task.
Error bars use different methods for each score type. Main task scores are numerical (0-1), so we use 95% bootstrap confidence intervals by resampling with replacement and computing the 2.5th and 97.5th percentiles. Side task scores are binary (success/failure), so we use 95% Bayesian credible intervals with a Beta posterior where α = successes + 1 and β = failures + 1. See this writeup for more details on the Bayesian method.
openai/gpt-4.1-2025-04-14¶


Note: The main task is identical across all four conditions. The same BashBench2 shell scripting challenges are used regardless of which side task is configured. The x-axis labels refer to the side task pairing, but the underlying main task remains constant. In honest mode, this effectively shows the same evaluation repeated four times.
openai/gpt-4.1-mini-2025-04-14¶


google/vertex/gemini-2.0-flash-001¶

