MODULE 1 - CHAPTER 5 ⏱️ 35 minutes 📖 2,400 words

Async Programming for AI APIs

Build high-performance AI applications with asyncio

Lesson 5: Async Programming for AI APIs

Modern AI applications are often I/O-bound, meaning they spend most of their time waiting for network requests to complete. This is especially true when your application needs to call multiple LLM APIs, databases, or other web services to fulfill a single user request.

Asynchronous programming allows your application to handle multiple of these I/O-bound tasks concurrently, dramatically improving performance and responsiveness. This lesson introduces the fundamentals of asyncio, Python's native library for async programming, and aiohttp, the go-to library for async HTTP requests.

1. The "Why": Sync vs. Async

Imagine making five API calls that each take 1 second to complete.

`` Call 1 (1s) -> Call 2 (1s) -> Call 3 (1s) -> Call 4 (1s) -> Call 5 (1s) = 5s `
  • Asynchronous (Async) Code: Starts all calls at roughly the same time and waits for them to complete concurrently. Total time = ~1 second (the time of the longest single call).
` Call 1 (start) Call 2 (start) Call 3 (start) Call 4 (start) Call 5 (start) ... (all wait for 1s) ... (all finish) = ~1s `

For AI agents that might need to call a search API, a calculator tool, and an LLM in parallel, this difference is massive.

2. Core Concepts: async, await, and asyncio

3. aiohttp: The Async requests

You can't use the standard requests library in async code because it's blocking. The most popular alternative is aiohttp. It's designed from the ground up to work with asyncio.

The key is the aiohttp.ClientSession, which allows you to manage a pool of connections for making efficient, concurrent requests.

4. The Pattern: Concurrent API Calls with asyncio.gather

The most common pattern for making multiple concurrent API calls is:

1. Create a single aiohttp.ClientSession. 2. Define an async function that takes the session and a URL and performs a single API call. 3. Create a list of "tasks" by calling your async function for each URL you want to fetch. 4. Use asyncio.gather(*tasks) to run all the tasks concurrently. 5. Use asyncio.run(main_coroutine) to start the entire process.

Example: Calling Multiple LLM Endpoints Concurrently

Let's write a full, runnable example. We'll hit a public test API, httpbin.org`, to simulate calling different services.

# example-04-async-api-calls.py
import asyncio
import aiohttp
import time

We'll use a test API that can simulate delays

URLS = [ "https://httpbin.org/delay/2", # Simulates a 2-second response "https://httpbin.org/delay/1", # Simulates a 1-second response "https://httpbin.org/get", # A fast response "https://httpbin.org/status/404", # An error response ]

async def fetch_url(session: aiohttp.ClientSession, url: str) -> dict: """A coroutine to fetch a single URL.""" print(f"Starting fetch for {url}...") try: async with session.get(url, timeout=10) as response: # This will raise an exception for 4xx/5xx status codes response.raise_for_status() print(f"Finished fetch for {url} with status {response.status}") # .json() is also a coroutine, so it must be awaited return await response.json() except aiohttp.ClientError as e: print(f"Error fetching {url}: {e.status} {e.message}") return {"error": True, "status": e.status, "url": url} except asyncio.TimeoutError: print(f"Timeout error fetching {url}") return {"error": True, "status": "Timeout", "url": url}

async def main(): """The main coroutine to orchestrate all the concurrent calls.""" start_time = time.time() # Create a single session to be reused for all requests. async with aiohttp.ClientSession() as session: # Create a list of tasks to run. These don't start yet. tasks = [fetch_url(session, url) for url in URLS] # asyncio.gather runs all the tasks concurrently. # return_exceptions=True ensures that one failed task doesn't stop the others. results = await asyncio.gather(*tasks, return_exceptions=True)

end_time = time.time() print("\n--- All tasks complete ---") for result in results: if isinstance(result, dict) and result.get("error"): print(f"Failed Task: URL {result.get('url')}, Status: {result.get('status')}") else: # Process successful results print(f"Successful Task: URL {result.get('url')}")

print(f"\nTotal execution time: {end_time - start_time:.2f} seconds") print("(Note: This is close to the longest delay, not the sum of all delays!)")

if __name__ == "__main__": # This is the entry point that starts the asyncio event loop. asyncio.run(main())

When you run this script, you'll see that the total time is just over 2 seconds, proving that the calls ran in parallel. This is the power of async programming and is a fundamental technique for building high-performance AI applications.