# RAI Bench

RAI Bench is a comprehensive package that both provides benchmarks with ready-to-use tasks and offers a framework for creating new tasks. It's designed to evaluate the performance of AI agents in various environments.

### Available Benchmarks

-   [Manipulation O3DE Benchmark](#manipulation-o3de-benchmark)
-   [Tool Calling Agent Benchmark](#tool-calling-agent-benchmark)
-   [VLM Benchmark](#vlm-benchmark)

## Manipulation O3DE Benchmark

Evaluates agent performance in robotic arm manipulation tasks within the O3DE simulation environment. The benchmark evaluates how well agents can process sensor data and use tools to manipulate objects in the environment.

### Framework Components

Manipulation O3DE Benchmark provides a framework for creating custom tasks and scenarios with these core components:

![Manipulation Benchmark Framework](../imgs/manipulation_benchmark.png)

### Task

The `Task` class is an abstract base class that defines the interface for tasks used in this benchmark.
Each concrete Task must implement:

-   prompts that will be passed to the agent
-   validation of simulation configurations
-   calculating results based on scene state

### Scenario

A `Scenario` represents a specific test case combining:

-   A task to be executed
-   A simulation configuration

### ManipulationO3DEBenchmark

The `ManipulationO3DEBenchmark` class manages the execution of scenarios and collects results. It provides:

-   Scenario execution management
-   Performance metrics tracking
-   Logs and results
-   Robotic stack needed, provided as `LaunchDescription`

### Available Tasks

The benchmark includes several predefined manipulation tasks:

1. **MoveObjectsToLeftTask** - Move specified objects to the left side of the table

2. **PlaceObjectAtCoordTask** - Place specified objects at specific coordinates

3. **PlaceCubesTask** - Place specified cubes adjacent to each other

4. **BuildCubeTowerTask** - Stack specified cubes to form a tower

5. **GroupObjectsTask** - Group specified objects of specified types together

Tasks are parametrizable so you can configure which objects should be manipulated and how much precision is needed to complete a task.

Tasks are scored on a scale from 0.0 to 1.0, where:

-   0.0 indicates no improvement or worse placement than the starting one
-   1.0 indicates perfect completion

The score is typically calculated as:

```
score = (correctly_placed_now - correctly_placed_initially) / initially_incorrect
```

### Available Scene Configs and Scenarios

You can find predefined scene configs in `rai_bench/manipulation_o3de/predefined/configs/`.

Predefined scenarios can be imported, for example, choosing tasks by difficulty:

```python
from rai_bench.manipulation_o3de import get_scenarios

get_scenarios(levels=["easy", "medium"])
```

## Tool Calling Agent Benchmark

Evaluates agent performance independently from any simulation, based only on tool calls that the agent makes. To make it independent from simulations, this benchmark introduces tool mocks which can be adjusted for different tasks. This makes the benchmark more universal and a lot faster.

### Framework Components

![Tool Calling Benchmark Framework](../imgs/tool_calling_agent_benchmark.png)

### SubTask

The `SubTask` class is used to validate just one tool call. Following classes are available:

-   `CheckArgsToolCallSubTask` - verify if a certain tool was called with expected arguments
-   `CheckTopicFieldsToolCallSubTask` - verify if a message published to ROS2 topic was of proper type and included expected fields
-   `CheckServiceFieldsToolCallSubTask` - verify if a message published to ROS2 service was of proper type and included expected fields
-   `CheckActionFieldsToolCallSubTask` - verify if a message published to ROS2 action was of proper type and included expected fields

### Validator

The `Validator` class can combine single or multiple subtasks to create a single validation step. Following validators are available:

-   OrderedCallsValidator - requires a strict order of subtasks. The next subtask will be validated only when the previous one was completed. Validator passes when all subtasks pass.
-   NotOrderedCallsValidator - doesn't enforce order of subtasks. Every subtask will be validated against every tool call. Validator passes when all subtasks pass.
-   OneFromManyValidator - passes when any one of the given subtasks passes.

### Task

A Task represents a specific prompts and set of tools available. A list of validators is assigned to validate the performance.

??? info "Task class definition"

    ::: rai_bench.tool_calling_agent.interfaces.Task

As you can see, the framework is very flexible. Any SubTask can be combined into any Validator that can be later assigned to any Task.

Every Task needs to define it's prompt and system prompt, what tools agent will have available, how many tool calls are required to complete it and how many optional tool calls are possible.

Optional tool calls mean that a certain tool calls is not obligatory to pass the Task, but shoudn't be considered an error, example: `GetROS2RGBCameraTask` which has prompt: `Get RGB camera image.` requires making one tool call with `get_ros2_image` tool. But listing topics before doing it is a valid approach, so in this case opitonal tool calls is `1`.

### ToolCallingAgentBenchmark

The ToolCallingAgentBenchmark class manages the execution of tasks and collects results.

### Available Tasks

There are predefined Tasks available which are grouped by categories:

-   Basic - require retrieving info from certain topics
-   Manipulation
-   Custom Interfaces - requires using messages with custom interfaces

Every Task has assigned the `complexity` which reflects the difficulty.

When creating a Task, you can define few params:

```python
class TaskArgs(BaseModel):
    """Holds the configurations specified by user"""

    extra_tool_calls: int = 0
    prompt_detail: Literal["brief", "descriptive"] = "brief"
    examples_in_system_prompt: Literal[0, 2, 5] = 0
```

-   examples_in_system_prompt - How many examples there are in system prompts, example:

    -   `0`: `You are a ROS 2 expert that want to solve tasks. You have access to various tools that allow you to query the ROS 2 system. Be proactive and use the tools to answer questions.`
    -   `2`: `You are a ROS 2 expert that want to solve tasks. You have access to various tools that allow you to query the ROS 2 system. Be proactive and use the tools to answer questions. Example of tool calls: get_ros2_message_interface, args: {'msg_type': 'geometry_msgs/msg/Twist'} publish_ros2_message, args: {'topic': '/cmd_vel', 'message_type': 'geometry_msgs/msg/Twist', 'message': {linear: {x: 0.5, y: 0.0, z: 0.0}, angular: {x: 0.0, y: 0.0, z: 1.0}}}`

-   prompt_detail - How descriptive should the Task prompt be, example:

    -   `brief`: "Get all camera images"
    -   `descriptive`: "Get all camera images from all available camera sources in the system.
        This includes both RGB color images and depth images.
        You can discover what camera topics are available and capture images from each."

        Descriptive prompts provides guidance and tips.

-   extra_tool_calls - How many extra tool calls an agent can make and still pass the Task, example:
    -   `GetROS2RGBCameraTask` has 1 required tool call and 1 optional. When `extra_tool_calls` set to 5, agent can correct himself couple times and still pass even with 7 tool calls. There can be 2 types of invalid tool calls, first when the tool is used incorrectly and agent receives an error - this allows him to correct himself easier. Second type is when tool is called properly but it is not the tool that should be called or it is called with wrong params. In this case agent won't get any error so it will be harder for him to correct, but BOTH of these cases are counted as `extra tool call`.

If you want to know details about every task, visit `rai_bench/tool_calling_agent/tasks`

## VLM Benchmark

The VLM Benchmark is a benchmark for VLM models. It includes a set of tasks containing questions related to images and evaluates the performance of the agent that returns the answer in the structured format.

### Running

To run the benchmark:

```bash
cd rai
source setup_shell.sh
python src/rai_bench/rai_bench/examples/vlm_benchmark.py --model-name gemma3:4b --vendor ollama
```
