๐ฏ What You'll Learn Today
Hello Ducktypers! Today we're diving into Berkeley's approach to testing AI function calling capabilities. We'll cover:
- Why traditional response-based testing falls short
- How BFCL's state-based evaluation works in practice
- The five categories of test scenarios BFCL introduces
- A deep dive into their evaluation architecture
- Real-world examples showing why state matters
By the end of this issue, you'll understand why checking function syntax alone isn't enough, and how to think about testing AI systems in a more robust way."
Let me show you why this matters. We have this diagram below:
Ducktypers, let me break it down for you. This diagram that shows the fundamental difference between traditional evaluation and Berkeley's Function Calling Leaderboard approach.
At the top, we start with a Model Response - this could be anything from 'create a new file' to 'buy this stock'. Now, traditionally - and this is where it gets interesting - we'd follow the left path, what we call 'Response-based' evaluation.
In Response-based evaluation, we're basically doing a string comparison. Did the model call create_file()
when we expected create_file()
? Did it use exactly the parameters we expected? It's like grading a math test by checking if the student wrote exactly the same steps as the answer key.
But here's where BFCL V3 gets clever with State-based evaluation. Instead of checking if the model wrote the exact same steps, it checks if the final result is correct. Let me give you a real-world example:
# Response-based would mark these as different:
approach_1 = [
"cd('/home')",
"mkdir('test')"
]
approach_2 = [
"pwd()", # Check current directory
"ls()", # List contents first
"cd('/home')",
"mkdir('test')"
]
Both approaches end up with the same result - a new 'test' directory in '/home'. State-based evaluation would mark both as correct because it tracks what actually happened to the:
- File System (did the directory get created?)
- API States (did the authentication succeed?)
- Database (was the record properly updated?)
Think of it like cooking - Response-based evaluation checks if you followed the recipe exactly, while State-based evaluation just checks if your cake is delicious!
This is why we see the State-based path highlighted in green in our diagram - it's a much more practical way to evaluate AI models in real-world scenarios where there might be multiple valid approaches to solve the same problem.
Is this making sense to everyone? Drop a '๐ฐ' in the comments if you prefer the cooking analogy, or a '๐' if the coding example helped more!"
๐ The Problem with Traditional Testing
Let's break down what the community behind Berkeley's Leaderboard observed. Here's an example:
# Traditional Evaluation (Too Rigid)
expected_calls = [
"get_stock_info(symbol='NVDA')"
]
# What a Real Model Might Do (Actually Fine!)
actual_calls = [
"search_company('NVIDIA')", # First attempt
"get_all_symbols()", # Second attempt
"get_stock_info('NVDA')" # Finally succeeds
]
See the difference? The traditional approach would mark this as wrong, but the model actually did its job - it just took a more exploratory path to get there.
๐ BFCL's New Evaluation Categories
Let us touch the latest evaluation categories of BFCL. I prepared this diagram below:
And this is crucial Ducktypers! Let me break down exactly what each of these test categories means, because this is where BFCL V3 gets really interesting.
Let me walk you through each category:
-
Base Multi-Turn (200 Tests)
# Example Base Test user_turn_1 = "Check my portfolio balance" assistant_turn_1 = get_portfolio_balance() user_turn_2 = "Buy 100 shares of that stock we discussed" assistant_turn_2 = place_order(shares=100)
These are your foundation tests - straightforward conversations where all information is available. It's like ordering coffee: "I'd like a latte" โ make_latte().
-
Augmented Tests (800 Tests)
# Example Augmented Test context = "User discussing NVIDIA stock performance" user_turn = "Buy some shares of that company" # Model needs to: # 1. Understand "that company" refers to NVIDIA # 2. Check current price # 3. Place order
These are more complex scenarios requiring the model to connect dots across multiple turns.
-
Missing Parameters (200 Tests)
# Example Missing Parameter Test functions = [book_flight(origin, destination, date)] user_turn = "Book me a flight to New York" # Model should ask: "What's your departure city?"
Here's where it gets tricky - the model must recognize when it needs more information and ask for it!
-
Missing Functions (200 Tests)
# Example Missing Function Test available_functions = [send_email(), check_balance()] user_turn = "Post this on Twitter" # Model should respond: "I don't have access to Twitter functions"
The model needs to recognize when it simply can't do what's asked with the functions available.
-
Long Context (200 Tests)
# Example Long Context Test context = "Directory with 500+ files..." user_turn = "Find the Python file I edited yesterday" # Model must sift through lots of noise to find relevant info
This tests how models handle information overload - can they still find the needle in the haystack?
If we look back at the diagram, we notice something: See how the distribution isn't even? That's intentional! The 800 Augmented Tests represent the most common real-world scenarios, while the specialized categories help catch specific edge cases.
Think of it like a driving test - the Base Tests are driving on an empty road, but the Augmented Tests throw in traffic, weather, and unexpected pedestrians. The specialized tests? Those are like checking if you know what to do when your GPS fails or you run out of gas!
Does this help explain why we need different types of tests? Has anyone here tried building test suites for their AI models? I'd love to hear your experiences in the comments!
๐ ๏ธ How BFCL Actually Works
Here's how evaluation pipelines look like:
class BFCLEvaluator:
def evaluate_model(self, model, test_case):
# 1. Initialize API State
initial_state = self.setup_api_state(test_case.initial_config)
# 2. For each turn in the conversation
for turn in test_case.turns:
# Get model response
model_response = model.generate(
question=turn.question,
available_functions=turn.function_list,
context=turn.context
)
# Execute functions and track state changes
new_state = self.execute_functions(
model_response.function_calls,
current_state=initial_state
)
# Compare with ground truth state
if not self.compare_states(
new_state,
turn.expected_state
):
return False
return True
But let me be very precise here, Ducktypers. While I showed a simplified conceptual implementation, the actual BFCL evaluation is straightforward. It is just a command line tool. According to Berkeley's official documentation, the evaluation process involves:
- Installing the actual package:
# This is from the official BFCL docs
git clone https://github.com/ShishirPatil/gorilla.git
cd gorilla/berkeley-function-call-leaderboard
pip install -e .
- Running evaluations using their CLI tool:
# Actual BFCL command from documentation
bfcl evaluate --model MODEL_NAME --test-category TEST_CATEGORY
The actual evaluation pipeline supports two types of models:
- Function-Calling (FC) models that output structured JSON
- Prompting models that output function call strings
For absolute clarity, let me show you an actual example from their documentation:
# Real example from BFCL docs
# For API-hosted models:
bfcl generate --model MODEL_NAME --test-category TEST_CATEGORY --num-threads 1
# For locally-hosted models:
![For locally-hosted models:](./quackchat-quackchat-berkeley-state-based-ai-testing-methodology-2024-04.webp)
bfcl generate --model MODEL_NAME --test-category TEST_CATEGORY --backend {vllm,sglang} --num-gpus 1
Rather than showing you a speculative implementation, I should point you to the official BFCL GitHub repository where you can see the actual implementation.
I want to really bring this point down home. Let me draw out the BFCL architecture using a diagram that really shows the flow:
Let me break this down step by step:
-
Models Section (Left)
- Each AI provider (Gorilla, OpenAI, etc.) needs their own specific handler
- Some models support Function Calling (-FC), others need prompting
-
Handler Section (Middle-Left)
- The Handler initializes the connection to the model
- Takes the evaluation data and formats it for the specific model
- [Points to the flow] See how everything converges here?
-
Runner Section (Middle)
- This is where the actual evaluation happens
- The model output gets decoded in two different ways:
# AST Format (for structural checking) {"function": "create_file", "params": {"name": "test.txt"}} # Executable Format (for runtime checking) 'create_file(name="test.txt")'
-
Checker Section (Right)
- AST Checker verifies the structure is correct
- Executable Checker runs the functions and verifies results
- Both feed into the final leaderboard statistics
See how everything converges at the Leaderboard Statistics? That's where BFCL combines both structural correctness AND execution success into a final score.
This is straight from their architecture - I've just made the flow more visible in our diagram. What's particularly clever is how they handle both Function Calling models and Prompt models through the same pipeline, just with different handlers.
Think about it like a driving test - the AST checker makes sure you know the rules of the road, while the Executable checker makes sure you can actually drive the car!"
๐งช Real Examples from BFCL
Let's look at a real failure case from their documentation
# Initial Context
context = {
"fuel_tank": 5, # Current gallons
"max_capacity": 50
}
# User Question
question = "Fill the fuel tank until we reach Rivermist. Save money, don't need full tank."
# Failed Model Response
model_response = {
"function": "fillFuelTank",
"parameters": {"fuel_amount": 50}
}
# Correct Ground Truth
ground_truth = [
{"function": "displayCarStatus", "parameters": {"component": "fuel"}},
{"function": "fillFuelTank", "parameters": {"fuel_amount": 44}}
]
This example shows why state-based evaluation matters - the model filled the tank completely instead of checking the current state first!
Overall, this is a perfect example to understand why state-based evaluation is crucial! Let me break this down step by step.
Let's analyze what's happening here:
-
Initial State
context = { "fuel_tank": 5, # We're starting with 5 gallons "max_capacity": 50 # Tank can't hold more than 50 }
It's like checking your gas gauge - you need to know where you're starting!
-
The Failed Approach
model_response = { "function": "fillFuelTank", "parameters": {"fuel_amount": 50} }
The model made THREE critical mistakes:
- Didn't check current fuel level (5 gallons)
- Tried to add 50 gallons to existing 5 (would overflow!)
- Ignored the 'save money' requirement
-
The Correct Approach
ground_truth = [ # Step 1: Check current status {"function": "displayCarStatus", "parameters": {"component": "fuel"}}, # Step 2: Add only what's needed (44 gallons) {"function": "fillFuelTank", "parameters": {"fuel_amount": 44}} ]
This solution:
- Checks current state first
- Calculates optimal amount (5 + 44 = 49 gallons, just under max)
- Saves money by not filling completely
This is why state-based evaluation is so important! A traditional response-based checker might only verify if the function names and parameter types are correct. But state-based evaluation catches these logical errors:
- State Overflow (trying to add too much fuel)
- Efficiency Issues (wasting money on unnecessary fuel)
- Missing Prerequisites (not checking current status)
Think of it like cooking - you wouldn't just start adding ingredients without checking what's already in the pot, right?
Notice how the correct approach follows a logical flow:
- Check current state
- Calculate what's needed
- Take action
This is what we mean by state-based evaluation - it's not just about calling the right functions, but calling them in a way that makes sense given the current state of the system.
Questions about this example? Drop them in the comments below!
๐ฏ Key Takeaways
Let's recap what we've learned today, Ducktypers:
-
Testing Evolution
- Traditional response-based testing only checks syntax
- State-based evaluation verifies actual outcomes
- Multiple valid paths can lead to the same correct result
-
BFCL's Innovation
- 1,600 comprehensive test cases across five categories
- Handles both function-calling and prompting models
- Tests real-world scenarios including missing information and long contexts
-
Practical Implementation
- Two-path evaluation: AST checking and Executable verification
- Unified pipeline for different model providers
- Real API integration for authentic testing
-
Why This Matters
- More realistic evaluation of AI capabilities
- Better alignment with real-world use cases
- Catches logical errors traditional testing misses
Remember, Ducktypers: in the real world, it's not just about calling the right functions - it's about making the right decisions based on the current state of your system.
Next week, we'll explore how different models perform on these new test categories, but until then, keep questioning not just what your models output, but whether those outputs make sense in context.
This is Prof Rod, signing off from QuackChat!
๐ฌ๐ง Chapter