Blog Image: QuackChat: From Recipes to Road Tests: Why Berkeley's New Way of Testing AI Changes Everything

QuackChat: From Recipes to Road Tests: Why Berkeley's New Way of Testing AI Changes Everything

QuackChat explores how Berkeley's Function Calling Leaderboard V3 transforms AI testing methodology. Key topics include: - Testing Philosophy: Why checking recipes isn't enough - we need to taste the cake - Evaluation Categories: Deep dive into 1,600 test cases across five distinct scenarios - Architecture Deep-Dive: How BFCL combines AST checking with executable verification - Real-World Examples: From fuel tanks to file systems - why state matters - Implementation Guide: Practical walkthrough of BFCL's testing pipeline

๐ŸŽฏ What You'll Learn Today

๐ŸŽฏ What You'll Learn Today

Hello Ducktypers! Today we're diving into Berkeley's approach to testing AI function calling capabilities. We'll cover:

  1. Why traditional response-based testing falls short
  2. How BFCL's state-based evaluation works in practice
  3. The five categories of test scenarios BFCL introduces
  4. A deep dive into their evaluation architecture
  5. Real-world examples showing why state matters

By the end of this issue, you'll understand why checking function syntax alone isn't enough, and how to think about testing AI systems in a more robust way."

Let me show you why this matters. We have this diagram below:

Traditional

BFCL V3

Model Response

Evaluation Method

Response-based

State-based

Check Function Names

Check Parameters

Track System State

Verify Final State

File System

API States

Database

Ducktypers, let me break it down for you. This diagram that shows the fundamental difference between traditional evaluation and Berkeley's Function Calling Leaderboard approach.

At the top, we start with a Model Response - this could be anything from 'create a new file' to 'buy this stock'. Now, traditionally - and this is where it gets interesting - we'd follow the left path, what we call 'Response-based' evaluation.

In Response-based evaluation, we're basically doing a string comparison. Did the model call create_file() when we expected create_file()? Did it use exactly the parameters we expected? It's like grading a math test by checking if the student wrote exactly the same steps as the answer key.

But here's where BFCL V3 gets clever with State-based evaluation. Instead of checking if the model wrote the exact same steps, it checks if the final result is correct. Let me give you a real-world example:


# Response-based would mark these as different:


approach_1 = [
    "cd('/home')",
    "mkdir('test')"
]

approach_2 = [
    "pwd()",              # Check current directory
    "ls()",              # List contents first
    "cd('/home')",
    "mkdir('test')"
]

Both approaches end up with the same result - a new 'test' directory in '/home'. State-based evaluation would mark both as correct because it tracks what actually happened to the:

  • File System (did the directory get created?)
  • API States (did the authentication succeed?)
  • Database (was the record properly updated?)

Think of it like cooking - Response-based evaluation checks if you followed the recipe exactly, while State-based evaluation just checks if your cake is delicious!

This is why we see the State-based path highlighted in green in our diagram - it's a much more practical way to evaluate AI models in real-world scenarios where there might be multiple valid approaches to solve the same problem.

Is this making sense to everyone? Drop a '๐Ÿฐ' in the comments if you prefer the cooking analogy, or a '๐Ÿ“' if the coding example helped more!"

๐Ÿ” The Problem with Traditional Testing

Response-based would mark these as different

Let's break down what the community behind Berkeley's Leaderboard observed. Here's an example:



# Traditional Evaluation (Too Rigid)

expected_calls = [
    "get_stock_info(symbol='NVDA')"
]



# What a Real Model Might Do (Actually Fine!)

actual_calls = [
    "search_company('NVIDIA')",  # First attempt
    "get_all_symbols()",         # Second attempt
    "get_stock_info('NVDA')"    # Finally succeeds
]

See the difference? The traditional approach would mark this as wrong, but the model actually did its job - it just took a more exploratory path to get there.

๐Ÿ“Š BFCL's New Evaluation Categories

๐Ÿ“Š BFCL's New Evaluation Categories

Let us touch the latest evaluation categories of BFCL. I prepared this diagram below:

BFCL V3 Tests

Base Multi-Turn

Augmented Tests

Missing Parameters

Missing Functions

Long Context

200 Base Tests

800 Complex Tests

200 Parameter Tests

200 Function Tests

200 Context Tests

And this is crucial Ducktypers! Let me break down exactly what each of these test categories means, because this is where BFCL V3 gets really interesting.

Let me walk you through each category:

  1. Base Multi-Turn (200 Tests)

    # Example Base Test
    user_turn_1 = "Check my portfolio balance"
    assistant_turn_1 = get_portfolio_balance()
    user_turn_2 = "Buy 100 shares of that stock we discussed"
    assistant_turn_2 = place_order(shares=100)

    These are your foundation tests - straightforward conversations where all information is available. It's like ordering coffee: "I'd like a latte" โ†’ make_latte().

  2. Augmented Tests (800 Tests)

    # Example Augmented Test
    context = "User discussing NVIDIA stock performance"
    user_turn = "Buy some shares of that company"
    # Model needs to:
    # 1. Understand "that company" refers to NVIDIA
    # 2. Check current price
    # 3. Place order

    These are more complex scenarios requiring the model to connect dots across multiple turns.

  3. Missing Parameters (200 Tests)

    # Example Missing Parameter Test
    functions = [book_flight(origin, destination, date)]
    user_turn = "Book me a flight to New York"
    # Model should ask: "What's your departure city?"

    Here's where it gets tricky - the model must recognize when it needs more information and ask for it!

  4. Missing Functions (200 Tests)

    # Example Missing Function Test
    available_functions = [send_email(), check_balance()]
    user_turn = "Post this on Twitter"
    # Model should respond: "I don't have access to Twitter functions"

    The model needs to recognize when it simply can't do what's asked with the functions available.

  5. Long Context (200 Tests)

    # Example Long Context Test
    context = "Directory with 500+ files..."
    user_turn = "Find the Python file I edited yesterday"
    # Model must sift through lots of noise to find relevant info

    This tests how models handle information overload - can they still find the needle in the haystack?

If we look back at the diagram, we notice something: See how the distribution isn't even? That's intentional! The 800 Augmented Tests represent the most common real-world scenarios, while the specialized categories help catch specific edge cases.

Think of it like a driving test - the Base Tests are driving on an empty road, but the Augmented Tests throw in traffic, weather, and unexpected pedestrians. The specialized tests? Those are like checking if you know what to do when your GPS fails or you run out of gas!

Does this help explain why we need different types of tests? Has anyone here tried building test suites for their AI models? I'd love to hear your experiences in the comments!

๐Ÿ› ๏ธ How BFCL Actually Works

๐Ÿ› ๏ธ How BFCL Actually Works

Here's how evaluation pipelines look like:

class BFCLEvaluator:
    def evaluate_model(self, model, test_case):
        # 1. Initialize API State
        initial_state = self.setup_api_state(test_case.initial_config)
        
        # 2. For each turn in the conversation
        for turn in test_case.turns:
            # Get model response
            model_response = model.generate(
                question=turn.question,
                available_functions=turn.function_list,
                context=turn.context
            )
            
            # Execute functions and track state changes
            new_state = self.execute_functions(
                model_response.function_calls,
                current_state=initial_state
            )
            
            # Compare with ground truth state
            if not self.compare_states(
                new_state, 
                turn.expected_state
            ):
                return False
                
        return True

But let me be very precise here, Ducktypers. While I showed a simplified conceptual implementation, the actual BFCL evaluation is straightforward. It is just a command line tool. According to Berkeley's official documentation, the evaluation process involves:

  1. Installing the actual package:


# This is from the official BFCL docs

git clone https://github.com/ShishirPatil/gorilla.git
cd gorilla/berkeley-function-call-leaderboard
pip install -e .
  1. Running evaluations using their CLI tool:


# Actual BFCL command from documentation

bfcl evaluate --model MODEL_NAME --test-category TEST_CATEGORY

The actual evaluation pipeline supports two types of models:

  • Function-Calling (FC) models that output structured JSON
  • Prompting models that output function call strings

For absolute clarity, let me show you an actual example from their documentation:



# Real example from BFCL docs



# For API-hosted models:

bfcl generate --model MODEL_NAME --test-category TEST_CATEGORY --num-threads 1



# For locally-hosted models:

![For locally-hosted models:](./quackchat-quackchat-berkeley-state-based-ai-testing-methodology-2024-04.webp)

bfcl generate --model MODEL_NAME --test-category TEST_CATEGORY --backend {vllm,sglang} --num-gpus 1

Rather than showing you a speculative implementation, I should point you to the official BFCL GitHub repository where you can see the actual implementation.

I want to really bring this point down home. Let me draw out the BFCL architecture using a diagram that really shows the flow:

Evaluation Paths

Evaluation Runner

Model Handler

Models

Initialize

Input

handler.inference&(data&)

decode_ast

decode_executable

checker.ast_checker&(&)

checker.executable_checker&(&)

Gorilla

OpenAI

Anthropic

Mistral

Others...

Handler Interface

Function Calling Data

Inference Endpoint

Runner

Model Output

AST Output &{func1: &{param1:val1&}&}

Executable Output unc1&(param1=val1&)

AST Checker

Executable Checker

Leaderboard Statistics

Let me break this down step by step:

  1. Models Section (Left)

    • Each AI provider (Gorilla, OpenAI, etc.) needs their own specific handler
    • Some models support Function Calling (-FC), others need prompting
  2. Handler Section (Middle-Left)

    • The Handler initializes the connection to the model
    • Takes the evaluation data and formats it for the specific model
    • [Points to the flow] See how everything converges here?
  3. Runner Section (Middle)

    • This is where the actual evaluation happens
    • The model output gets decoded in two different ways:
      # AST Format (for structural checking)
      {"function": "create_file", "params": {"name": "test.txt"}}
      
      # Executable Format (for runtime checking)
      'create_file(name="test.txt")'
  4. Checker Section (Right)

    • AST Checker verifies the structure is correct
    • Executable Checker runs the functions and verifies results
    • Both feed into the final leaderboard statistics

See how everything converges at the Leaderboard Statistics? That's where BFCL combines both structural correctness AND execution success into a final score.

This is straight from their architecture - I've just made the flow more visible in our diagram. What's particularly clever is how they handle both Function Calling models and Prompt models through the same pipeline, just with different handlers.

Think about it like a driving test - the AST checker makes sure you know the rules of the road, while the Executable checker makes sure you can actually drive the car!"

๐Ÿงช Real Examples from BFCL

๐Ÿงช Real Examples from BFCL

Let's look at a real failure case from their documentation



# Initial Context

context = {
    "fuel_tank": 5,  # Current gallons
    "max_capacity": 50
}



# User Question

question = "Fill the fuel tank until we reach Rivermist. Save money, don't need full tank."



# Failed Model Response

model_response = {
    "function": "fillFuelTank",
    "parameters": {"fuel_amount": 50}
}



# Correct Ground Truth

ground_truth = [
    {"function": "displayCarStatus", "parameters": {"component": "fuel"}},
    {"function": "fillFuelTank", "parameters": {"fuel_amount": 44}}
]

This example shows why state-based evaluation matters - the model filled the tank completely instead of checking the current state first!

Overall, this is a perfect example to understand why state-based evaluation is crucial! Let me break this down step by step.

Correct Approach

displayCarStatus

5 current + 44 needed = 49 total

fillFuelTank amount=44

Step 1: Check Status

Step 2: Calculate Need

Step 3: Fill Tank

SUCCESS: 49 gallons. Saves money!

Failed Approach

fillFuelTank amount=50

Model Response

ERROR: Overflow! 5 + 50 > 50 max

Initial State

Fuel Tank: 5 gallons

Max Capacity: 50 gallons

Let's analyze what's happening here:

  1. Initial State

    context = {
        "fuel_tank": 5,    # We're starting with 5 gallons
        "max_capacity": 50  # Tank can't hold more than 50
    }

    It's like checking your gas gauge - you need to know where you're starting!

  2. The Failed Approach

    model_response = {
        "function": "fillFuelTank",
        "parameters": {"fuel_amount": 50}
    }

    The model made THREE critical mistakes:

    • Didn't check current fuel level (5 gallons)
    • Tried to add 50 gallons to existing 5 (would overflow!)
    • Ignored the 'save money' requirement
  3. The Correct Approach

    ground_truth = [
        # Step 1: Check current status
        {"function": "displayCarStatus", "parameters": {"component": "fuel"}},
        # Step 2: Add only what's needed (44 gallons)
        {"function": "fillFuelTank", "parameters": {"fuel_amount": 44}}
    ]

    This solution:

    • Checks current state first
    • Calculates optimal amount (5 + 44 = 49 gallons, just under max)
    • Saves money by not filling completely

This is why state-based evaluation is so important! A traditional response-based checker might only verify if the function names and parameter types are correct. But state-based evaluation catches these logical errors:

  1. State Overflow (trying to add too much fuel)
  2. Efficiency Issues (wasting money on unnecessary fuel)
  3. Missing Prerequisites (not checking current status)

Think of it like cooking - you wouldn't just start adding ingredients without checking what's already in the pot, right?

Notice how the correct approach follows a logical flow:

  1. Check current state
  2. Calculate what's needed
  3. Take action

This is what we mean by state-based evaluation - it's not just about calling the right functions, but calling them in a way that makes sense given the current state of the system.

Questions about this example? Drop them in the comments below!

๐ŸŽฏ Key Takeaways

๐ŸŽฏ Key Takeaways

Let's recap what we've learned today, Ducktypers:

  1. Testing Evolution

    • Traditional response-based testing only checks syntax
    • State-based evaluation verifies actual outcomes
    • Multiple valid paths can lead to the same correct result
  2. BFCL's Innovation

    • 1,600 comprehensive test cases across five categories
    • Handles both function-calling and prompting models
    • Tests real-world scenarios including missing information and long contexts
  3. Practical Implementation

    • Two-path evaluation: AST checking and Executable verification
    • Unified pipeline for different model providers
    • Real API integration for authentic testing
  4. Why This Matters

    • More realistic evaluation of AI capabilities
    • Better alignment with real-world use cases
    • Catches logical errors traditional testing misses

Remember, Ducktypers: in the real world, it's not just about calling the right functions - it's about making the right decisions based on the current state of your system.

Next week, we'll explore how different models perform on these new test categories, but until then, keep questioning not just what your models output, but whether those outputs make sense in context.

This is Prof Rod, signing off from QuackChat!

Rod Rivera

๐Ÿ‡ฌ๐Ÿ‡ง Chapter