russo¶

Testing framework for LLM tool-call accuracy — audio & text

Documentation: https://mohit2152sharma.github.io/russo

Source Code: https://github.com/mohit2152sharma/russo

russo is a testing framework for verifying that LLM agents make the correct tool calls when given audio (or text) input. Think of it as pytest for voice AI tool-calling accuracy.

Why russo?¶

Voice AI agents powered by LLMs increasingly use tool calling (function calling) to take actions — booking flights, controlling smart homes, querying databases. But how do you verify the agent calls the right tool with the right arguments when it hears a spoken command?

russo solves this with a simple pipeline:

Text Prompt → Synthesizer (TTS) → Agent Under Test → Evaluator → Pass/Fail

Key Features¶

Provider-agnostic — works with Gemini, OpenAI, or any custom agent via structural typing (protocols)
Audio-first — synthesize text prompts to audio, send to your agent, evaluate tool calls
pytest integration — use markers, fixtures, and familiar test patterns
Built-in caching — skip TTS on repeated runs, saving time and money
Extensible — swap synthesizers, agents, evaluators, and parsers without inheritance

Quick Example¶

import russo
from russo.synthesizers import GoogleSynthesizer
from russo.adapters import GeminiLiveAgent
from russo.evaluators import ExactEvaluator

result = await russo.run(
    prompt="Book a flight from Berlin to Rome for tomorrow",
    synthesizer=GoogleSynthesizer(api_key="..."),
    agent=GeminiLiveAgent(api_key="...", tools=[...]),
    evaluator=ExactEvaluator(),
    expect=[
        russo.tool_call("book_flight", from_city="Berlin", to_city="Rome"),
    ],
)

assert result.passed

Or with pytest:

import pytest
import russo

@pytest.mark.russo(
    prompt="Book a flight from NYC to LA",
    expect=[russo.tool_call("book_flight", from_city="NYC", to_city="LA")],
)
async def test_book_flight(russo_result):
    russo.assert_tool_calls(russo_result)

Architecture¶

russo uses a protocol-based design. You never need to inherit from base classes — if your object has the right methods, it works:

graph LR
    A[Text Prompt] --> B[Synthesizer]
    B --> C[Audio]
    C --> D[Agent]
    D --> E[AgentResponse]
    E --> F[Evaluator]
    F --> G[EvalResult]

Protocol	Method	Purpose
`Synthesizer`	`async synthesize(text) → Audio`	Convert text to audio
`Agent`	`async run(audio) → AgentResponse`	Run the agent under test
`Evaluator`	`evaluate(expected, actual) → EvalResult`	Compare tool calls
`ResponseParser`	`parse(raw) → AgentResponse`	Normalize provider responses

Next Steps¶

Installation — install russo and optional dependencies
First Test — write your first tool-call test
Tutorial — deep dive into every component
API Reference — full API documentation