Tool Use in AI Agents: Connecting LLMs to APIs and Functions – The Complete Guide


Introduction

Imagine asking an AI assistant to book a flight, update your CRM, and send a follow-up email to a client—all in one conversation. A traditional language model, no matter how sophisticated, cannot do this. It exists in a frozen state, limited to its training data, unable to interact with the outside world. It can describe how to book a flight, but it cannot actually book it.

This is where tool use—also called function calling or tool calling—transforms everything. Tool use provides the critical I/O layer that breaks LLM isolation, allowing models to output structured instructions that external systems can execute . It bridges the gap between probabilistic reasoning and deterministic execution, turning passive chatbots into active agents that can access real-time data, modify system states, and complete complex workflows.

According to Skywork.ai’s 2026 guide, “Modern agent systems don’t just ‘chat.’ They plan, call tools, browse, and synthesize grounded outputs you can audit. Done well, they feel like reliable coworkers” .

In this comprehensive guide, you’ll learn:

  • What tool use is and why it’s the foundation of agentic AI
  • The anatomy of tool calls: from discovery to execution
  • How to design robust tool interfaces with strict schemas
  • The Model Context Protocol (MCP) and its role as the “USB-C for AI”
  • Security, sandboxing, and governance for production deployments
  • Step-by-step implementation with OpenAI, Anthropic, and Google Gemini
  • Real-world enterprise patterns and evaluation strategies

Let’s dive into the mechanics of giving AI agents the ability to act.


Part 1: What Is Tool Use in AI Agents?

Definition and Core Concept

Tool use (also called function calling or tool calling) is the mechanism that enables large language models to output structured data—typically JSON—that instructs an external system to perform an action, rather than generating free text . This capability bridges three critical gaps:

CapabilityDescriptionExample
Real-Time Data AccessOvercomes training cutoffs by fetching live informationWeather API, stock prices, database queries
Action ExecutionTransforms LLM from passive observer to active participantSending emails, updating CRMs, deploying code
Structured InteroperabilityForces probabilistic reasoning into deterministic, machine-readable formatsJSON schemas for legacy system integration

Without tool use, LLMs are isolated reasoning engines. With tool use, they become active agents that can perceive, plan, and act in the digital world .

The Mental Model: From Text Generation to Structured Output

The shift from traditional prompting to tool use requires a mental model change. Instead of asking the model to know things, you ask it to look things up or do things .

*Figure 1: Traditional LLMs generate text for humans to act upon; tool-using agents generate structured instructions for systems to execute*

A Taxonomy of Tool-Using Agents

Tool-using agents typically combine three patterns, often blended within one workflow :

PatternDescriptionBest For
Function/Tool CallingModel outputs structured JSON to invoke predefined functionsSingle-step actions, API calls
ReAct (Reason + Act)Interleaves observation, reasoning, and action in a loopMulti-step, exploratory tasks
Plan-and-ExecuteCreates full plan upfront, then executes stepsComplex workflows, cost optimization

Part 2: The Anatomy of Tool Calling – A 6-Step Agentic Loop

Early documentation described a simple 5-step loop. In modern production environments using dynamic discovery, this loop has evolved into a 6-step process .

*Figure 2: The 6-step agentic loop for production tool calling*

Step 0: Tool Discovery

Before the LLM can call a tool, the system must find the right tools from potentially thousands of options. Loading definitions for 50+ tools into the system prompt creates two problems: cost and latency (58 tools can consume ~55k tokens) and accuracy degradation (more options = lower selection accuracy) .

Solutions like Anthropic’s Tool Search address this by allowing the model to “search” for tools rather than having them all pre-loaded . The impact:

  • Token Reduction: Dynamic loading reduces token usage by 85% (from ~77k to ~8.7k)
  • Accuracy Improvement: Accuracy improved from 79.5% to 88.1% with extensive tool catalogs

Step 1: Tool Definition (JSON Schema)

Tools are defined using JSON schemas that act as deterministic contracts between the LLM and your system . A well-defined schema includes:

json

{
  "name": "update-hotel",
  "description": "Updates an existing hotel booking with new dates",
  "parameters": {
    "type": "object",
    "additionalProperties": false,
    "required": ["booking_id", "checkin_date", "checkout_date"],
    "properties": {
      "booking_id": {"type": "string", "minLength": 1},
      "checkin_date": {"type": "string", "format": "date"},
      "checkout_date": {"type": "string", "format": "date"},
      "room_type": {"type": "string", "enum": ["standard", "deluxe", "suite"]}
    }
  },
  "timeouts": {"call_ms": 20000, "retries": 2}
}

Step 2: User Prompt

The user provides a natural language request that implies the need for external action.

Step 3: LLM Prediction

The model analyzes the prompt against the available tool definitions and outputs a structured JSON payload—the “tool call” .

Step 4: Execution (The Bottleneck)

This is the most complex step in production. The application code receives the JSON, handles authentication, executes the logic against the external API, and manages errors. As Composio’s 2026 guide notes, “Knowing which tool to call is trivial compared to the infrastructure required to call it successfully” .

Step 5: Final Response

The tool output feeds back to the LLM to generate the human-readable confirmation.


Part 3: Designing Robust Tool Interfaces

Tools as Contracts

In production systems, tools are contracts first, code second. Make the contract unambiguous, validate strictly, and fail safely .

Implementation Checklist:

RequirementDescription
Strict SchemasUse additionalProperties: false, enums, min/max bounds
Pre-Call ValidationValidate JSON against schema before execution
Post-Call VerificationVerify output shape and sanity
TimeoutsSet explicit timeouts per tool (e.g., 20 seconds)
RetriesExponential backoff for transient failures
IdempotencyDesign calls so retries are safe (use idempotency keys)
Context HygieneSummarize long observations to avoid token bloat

Example: Tool Middleware with Pre/Post Processing

Google’s GenAI Toolbox demonstrates pre- and post-processing middleware for enforcing business rules and enriching responses :

python

# Pre-processing: Business rule enforcement
@wrap_tool_call
async def enforce_business_rules(request, handler):
    tool_call = request.tool_call
    args = tool_call["args"]
    
    # Enforce max stay duration (14 days)
    if tool_call["name"] == "update-hotel":
        start = datetime.fromisoformat(args["checkin_date"])
        end = datetime.fromisoformat(args["checkout_date"])
        if (end - start).days > 14:
            return ToolMessage(
                content="Error: Maximum stay duration is 14 days.",
                tool_call_id=tool_call["id"]
            )
    
    return await handler(request)

# Post-processing: Response enrichment
@wrap_tool_call
async def enrich_response(request, handler):
    result = await handler(request)
    
    if isinstance(result, ToolMessage) and "Error" not in result.content:
        # Add loyalty points to successful bookings
        result.content = f"Booking Confirmed! You earned 500 Loyalty Points.\n{result.content}"
    
    return result

The Schema Design Principles

The convergence of Schema-Guided Dialogue (SGD) and the Model Context Protocol (MCP) reveals five foundational principles for schema design :

PrincipleDescriptionWhy It Matters
Semantic CompletenessDescriptions should explain what and why, not just syntaxModels need context to choose correct tools
Explicit Action BoundariesClearly define what the tool can and cannot doPrevents misuse and overreach
Failure Mode DocumentationDescribe expected failure conditionsEnables graceful recovery
Progressive DisclosureLayer complexity; expose details only when neededManages token budgets
Inter-Tool RelationshipsDeclare dependencies between toolsEnables multi-step workflows

Part 4: The Model Context Protocol (MCP) – The USB-C for AI Tools

What Is MCP?

Introduced by Anthropic in November 2024, the Model Context Protocol (MCP) is an open standard designed to solve the “N-to-M” integration problem . Traditionally, if an AI application wanted to connect to ten different tools, it had to build ten unique, bespoke integrations. MCP standardizes this communication, allowing any compliant host to interact with any compliant server through a standardized set of primitives.

*Figure 3: MCP standardizes integration, reducing N-to-M complexity to N-to-1*

MCP Architecture

MCP divides responsibilities into three distinct roles :

RoleDescriptionExample
HostAI application initiating connectionsClaude Desktop, IDE plugin
ClientMaintains 1:1 connection with serverMCP client library
ServerProvides tools, resources, and promptsGitHub MCP server, filesystem MCP server

What MCP Does and Doesn’t Solve

MCP provides a specification for communication. It excels at standardization but does not provide :

  • OAuth 2.0 lifecycle management for 10,000 users
  • Rate limit handling when APIs return 429
  • SOC 2 compliance logs for every action
  • Authentication token storage and refresh

These execution-layer concerns must be handled by the application or a dedicated execution platform.

Real-World MCP Implementation: Microsoft Graph

The @frustrated/ms-graph-mcp package demonstrates a production-ready MCP server for Microsoft Graph :

bash

# Initialize with OAuth 2.0 PKCE flow
bunx @frustrated/ms-graph-mcp init

# Run MCP server
bunx @frustrated/ms-graph-mcp run

# Manage permissions
bunx @frustrated/ms-graph-mcp permissions

# Revoke access
bunx @frustrated/ms-graph-mcp revoke

Security features include:

  • Token cache stored with 0600 permissions (owner read/write only)
  • HTTPS-only communication
  • Input validation on all requests
  • Output sanitization before passing to AI agents

Part 5: Tool Calling Across Major Providers

While tool calling principles are universal, implementation details vary across providers. Here’s a comparison based on 2026 best practices :

DimensionOpenAIAnthropic (Claude)Google (Gemini)
Tool Definition FormatJSON SchemaJSON Schema with input_schemaJSON Schema in functionDeclarations
Tool DiscoveryManual tool loadingTool Search (dynamic discovery for 30+ tools)Via Vertex AI Agent Builder
Structured OutputsJSON Schema enforcement at API leveloutput_config.format (cannot combine with citations)JSON Schema via config; combinable with tools
Grounding / CitationsTool calling for retrievalCitations API (structured source linkage)Google Search grounding with groundingMetadata
Prompt CachingStable prefix at beginningExact prefix match; documented cache breakpointsContext caching for long-context workloads
Reasoning Controlsreasoning_effort parameterExtended thinking; effort settingsthinkingLevel / thinkingBudget

OpenAI Tool Calling

OpenAI’s tool calling uses the tools parameter with JSON Schema definitions :

python

from openai import OpenAI

client = OpenAI()

tools = [{
    "type": "function",
    "function": {
        "name": "get_weather",
        "description": "Get current weather for a location",
        "parameters": {
            "type": "object",
            "properties": {
                "location": {"type": "string", "description": "City and state"},
                "unit": {"type": "string", "enum": ["celsius", "fahrenheit"]}
            },
            "required": ["location"]
        }
    }
}]

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "What's the weather in San Francisco?"}],
    tools=tools,
    tool_choice="auto"
)

Anthropic Tool Calling

Anthropic’s Claude uses the tools parameter with a similar structure :

python

import anthropic

client = anthropic.Anthropic()

response = client.messages.create(
    model="claude-3-5-sonnet-20241022",
    max_tokens=1024,
    tools=[{
        "name": "get_weather",
        "description": "Get current weather for a location",
        "input_schema": {
            "type": "object",
            "properties": {
                "location": {"type": "string", "description": "City and state"},
                "unit": {"type": "string", "enum": ["celsius", "fahrenheit"]}
            },
            "required": ["location"]
        }
    }],
    messages=[{"role": "user", "content": "What's the weather in San Francisco?"}]
)

Google Gemini Tool Calling

Gemini uses functionDeclarations within the tools parameter :

python

import google.generativeai as genai

model = genai.GenerativeModel('gemini-2.0-flash-exp')

response = model.generate_content(
    "What's the weather in San Francisco?",
    tools=[{
        "function_declarations": [{
            "name": "get_weather",
            "description": "Get current weather for a location",
            "parameters": {
                "type": "object",
                "properties": {
                    "location": {"type": "string", "description": "City and state"},
                    "unit": {"type": "string", "enum": ["celsius", "fahrenheit"]}
                },
                "required": ["location"]
            }
        }]
    }]
)

Part 6: The Execution Gap – From Discovery to Production

Why Tool Discovery Doesn’t Equal Production Readiness

A critical insight from production deployments: “Knowing which tool to call is trivial compared to the infrastructure required to call it successfully” . The execution layer is where most engineering teams encounter challenges.

The Three Hidden Challenges

1. Per-User Authentication at Scale

In a demo, you store an API key in a .env file. In production, you have thousands of users who need to connect their own Salesforce, GitHub, or Gmail accounts .

RequirementImplementation
OAuth clientHandles redirects, state parameters
Token storageSecure, encrypted storage with user isolation
Refresh logicAutomatic refresh before expiry
Scope managementPer-user permission scopes

2. API Heterogeneity and Reliability

APIs are brittle. Each has different:

  • Rate limits (429 responses require exponential backoff)
  • Pagination (LLMs see first page only; execution layer must aggregate)
  • Error formats (success requires parsing varied responses)
  • Authentication (OAuth, API keys, JWTs, Basic Auth)

3. Agent Governance

If your agent has access to delete_repo in GitHub, who can call it? MCP provides the capability but doesn’t enforce the policy . Production requires:

  • RBAC/ABAC: Role-based and attribute-based access control
  • Scope Validation: Read vs. write permissions per tool
  • Audit Trails: Log every tool call with risk labels
  • Human-in-the-Loop: Approvals for sensitive actions

Production Readiness Checklist

ComponentRequirementRisk of Neglect
Auth ManagementPer-user OAuth token refresh & storageAgents fail mid-task from expired tokens
ObservabilityLog every tool call, input, outputImpossible to debug failures
Rate LimitingExponential backoff & retry logicEntire IP blocked by API provider
Output NormalizationStandardize JSON from varied APIsLLM confused by unstructured responses
PermissionsScope validation (Read vs. Write)Agent accidentally deletes data
IdempotencySafe retries with idempotency keysDuplicate actions on retry

Part 7: Building Tool-Using Agents with Frameworks

LangChain + Composio Integration

The Composio Tool Router provides a managed execution layer that handles authentication, rate limiting, and tool discovery :

python

import asyncio
from langchain.agents import create_agent
from langchain_mcp_adapters.client import MultiServerMCPClient
from composio_langchain import ComposioToolSet, App

async def main():
    # Initialize Composio client
    composio_toolset = ComposioToolSet()
    
    # Create Tool Router session for Safetyculture
    session = composio_toolset.create_tool_router_session(
        user_id="user-123",
        apps=[App.SAFETYCULTURE]
    )
    
    # Create MCP client with Tool Router URL
    mcp_client = MultiServerMCPClient({
        "safetyculture": {
            "transport": "sse",
            "url": session.mcp_url
        }
    })
    
    # Get tools from MCP server
    tools = await mcp_client.get_tools()
    
    # Create LangChain agent
    agent = create_agent(
        model="gpt-4o",
        tools=tools,
        system_prompt="You are a safety management assistant..."
    )
    
    # Run agent with conversation history
    response = await agent.ainvoke({
        "messages": [{"role": "user", "content": "Show inspections updated this week"}]
    })

Google ADK with Interactions API

Google’s Agent Development Kit (ADK) now supports the Interactions API for stateful, multi-turn tool-using workflows :

python

from google.adk.agents.llm_agent import Agent
from google.adk.models.google_llm import Gemini
from google.adk.tools.google_search_tool import GoogleSearchTool

root_agent = Agent(
    model=Gemini(
        model="gemini-2.5-flash",
        use_interactions_api=True,  # Enable Interactions API
    ),
    name="interactions_test_agent",
    tools=[
        GoogleSearchTool(bypass_multi_tools_limit=True),
        get_current_weather,
    ],
)

The Interactions API provides :

  • Unified Model & Agent Access: Same endpoint for models or built-in agents
  • Simplified State Management: Offload conversation history with previous_interaction_id
  • Background Execution: Support for long-running tasks
  • Native Thought Handling: Explicit modeling of reasoning chains

Open Responses Specification

OpenAI’s Open Responses specification standardizes agentic workflows across providers . Key concepts include:

ConceptDescription
ItemsAtomic units: messages, function calls, reasoning traces
Reasoning TypeExposes model thinking in service-controlled format
Internal ToolsExecuted in service infrastructure (retrieval, summarization)
External ToolsExecuted in developer code; service pauses for response

The specification has early support from Hugging Face, OpenRouter, Vercel, LM Studio, Ollama, and vLLM .


Part 8: Security and Governance

Defense in Depth for Tool-Using Agents

Treat all external content as untrusted and defend in layers :

LayerControl
InputDelimit and sanitize; use allowlists
Tool PermissionsLeast-privilege credentials; validate arguments
ExecutionIsolated sandboxes (containers, gVisor, Firecracker, seccomp)
OutputSanitize before passing back to LLM
AuditLog every call; enable human-in-the-loop for sensitive actions

Authentication and Credential Management

Production systems require :

  • Per-User OAuth: Handle redirects, store refresh tokens securely
  • Automatic Refresh: Refresh tokens 5 minutes before expiry
  • Secrets Vault: Never store credentials in code or environment variables
  • Scope Isolation: Different tokens for different permission levels

OWASP AI Agent Security Guidance

Key recommendations from OWASP’s AI Agent Security Cheat Sheet :

  1. Validate all tool inputs before execution
  2. Implement rate limiting per tool and per user
  3. Use allowlists for tool availability
  4. Log everything for forensic analysis
  5. Red-team regularly to probe jailbreaks and exfiltration

Part 9: Evaluation and Observability

Metrics That Matter

Instrument agents like production services :

MetricDescriptionTarget
Task Success RateEnd-to-end completion rate>80%
Tool-Call AccuracyCorrect tool selection and parameters>90%
Tool Success RateSuccessful API execution>95%
Retrieval FaithfulnessGrounding in retrieved context>85%
LatencyTime from prompt to response<5 seconds
Cost per TaskToken consumption × model pricingDepends on use case

OpenTelemetry for Agent Observability

Create spans for each stage of the tool-calling loop :

python

import opentelemetry.trace as trace

tracer = trace.get_tracer("agent.tool_calling")

with tracer.start_as_current_span("tool_discovery") as span:
    tools = discover_tools(query)
    span.set_attribute("tools_found", len(tools))

with tracer.start_as_current_span("llm_tool_call") as span:
    response = llm.generate(messages, tools)
    span.set_attribute("tool_chosen", response.tool_call.name)

with tracer.start_as_current_span("api_execution") as span:
    result = call_api(response.tool_call)
    span.set_attribute("api_status", result.status)
    span.set_attribute("api_latency_ms", result.latency)

Testing Frameworks

  • Scenario Suites: End-to-end traces with explicit pass/fail checks
  • LLM-as-Judge: For faithfulness evaluation when ground truth is scarce
  • Code-Based Checks: For reproducibility on deterministic tasks

Part 10: MHTECHIN’s Expertise in Tool-Using Agents

At MHTECHIN, we specialize in building production-grade AI agents with robust tool-calling capabilities. Our expertise spans:

  • Custom Tool Integration: Connecting agents to enterprise APIs, databases, and legacy systems
  • MCP Server Development: Building standardized interfaces for your internal tools
  • Authentication & Governance: OAuth flows, token management, and permission systems
  • Evaluation & Observability: Instrumentation, metrics, and testing frameworks

MHTECHIN’s solutions leverage state-of-the-art frameworks including LangChain, AutoGen, and custom MCP implementations to deliver agents that don’t just chat—they act.


Conclusion

Tool use is the defining capability that transforms LLMs from conversational interfaces into autonomous agents. By providing structured I/O, real-time data access, and action execution, tool calling bridges the gap between probabilistic reasoning and deterministic action.

Key Takeaways:

  • Tool calling is the I/O layer that enables agents to interact with external systems
  • The 6-step agentic loop includes discovery, definition, prediction, execution, and response
  • Strict schemas with pre- and post-validation ensure reliable tool contracts
  • MCP standardizes integration but doesn’t solve authentication or governance
  • The execution gap—auth, rate limits, pagination—is where production complexity lives
  • Security requires defense in depth: least privilege, sandboxing, audit trails
  • Evaluation and observability are essential for production reliability

As the 2026 ecosystem matures, with standards like MCP and Open Responses reducing fragmentation, the barrier to building capable, secure tool-using agents continues to fall. The organizations that succeed will be those that invest not just in model capabilities, but in the execution infrastructure—authentication, observability, governance—that makes tool calling reliable at scale.


Frequently Asked Questions (FAQ)

Q1: What is tool calling in AI agents?

Tool calling (or function calling) is the mechanism that allows LLMs to output structured data—typically JSON—that instructs an external system to perform an action, rather than generating free text .

Q2: How does tool calling differ from prompting?

Prompts ask the model to generate text; tool calling asks the model to output structured instructions that systems can execute. Tool calling enables real-time data access and action execution .

Q3: What is the Model Context Protocol (MCP)?

MCP is an open standard introduced by Anthropic in 2024 that standardizes how AI applications connect to tools and services—the “USB-C for AI” .

Q4: What are the key steps in a tool-calling loop?

The modern 6-step loop includes: Tool Discovery, Tool Definition, User Prompt, LLM Prediction, Execution, and Final Response .

Q5: What security considerations exist for tool-using agents?

Production requirements include per-user OAuth, least-privilege credentials, sandboxed execution, input sanitization, rate limiting, and immutable audit trails .

Q6: How do I choose between OpenAI, Anthropic, and Gemini for tool calling?

OpenAI offers robust JSON Schema enforcement; Anthropic provides Tool Search for large tool catalogs; Gemini integrates tightly with Google Search grounding. Selection depends on your tool catalog size and grounding needs .

Q7: What is the “execution gap”?

The gap between tool discovery (knowing which tool to call) and production execution (handling auth, rate limits, pagination, and governance). This is where most engineering complexity lies .

Q8: How do I evaluate tool-calling performance?

Track task success rate, tool-call accuracy, tool success rate, retrieval faithfulness, latency, and cost. Use OpenTelemetry for distributed tracing .


Vaishnavi Patil Avatar

Leave a Reply

Your email address will not be published. Required fields are marked *