Lesson 5: Async Programming for AI APIs
Modern AI applications are often I/O-bound, meaning they spend most of their time waiting for network requests to complete. This is especially true when your application needs to call multiple LLM APIs, databases, or other web services to fulfill a single user request.
asyncio, Python's native library for async programming, and aiohttp, the go-to library for async HTTP requests.
1. The "Why": Sync vs. Async
Imagine making five API calls that each take 1 second to complete.
- Synchronous (Sync) Code: Executes one call at a time. Total time = 5 seconds.
Call 1 (1s) -> Call 2 (1s) -> Call 3 (1s) -> Call 4 (1s) -> Call 5 (1s) = 5s
`
- Asynchronous (Async) Code: Starts all calls at roughly the same time and waits for them to complete concurrently. Total time = ~1 second (the time of the longest single call).
`
Call 1 (start)
Call 2 (start)
Call 3 (start)
Call 4 (start)
Call 5 (start)
... (all wait for 1s) ...
(all finish) = ~1s
`
For AI agents that might need to call a search API, a calculator tool, and an LLM in parallel, this difference is massive.
2. Core Concepts:
async, await, and asyncio
async def: This syntax defines a coroutine function. It's a special kind of function that can be paused and resumed.
await: This keyword is used inside an async def function to call another coroutine. It essentially says, "Pause this function here, let other tasks run, and resume me when this result is ready."
asyncio: Python's built-in library that manages the event loop, which is responsible for scheduling and running all the async tasks.
3.
aiohttp: The Async requests
You can't use the standard
requests library in async code because it's blocking. The most popular alternative is aiohttp. It's designed from the ground up to work with asyncio.
The key is the
aiohttp.ClientSession, which allows you to manage a pool of connections for making efficient, concurrent requests.
4. The Pattern: Concurrent API Calls with
asyncio.gather
The most common pattern for making multiple concurrent API calls is:
1. Create a single
aiohttp.ClientSession.
2. Define an async function that takes the session and a URL and performs a single API call.
3. Create a list of "tasks" by calling your async function for each URL you want to fetch.
4. Use asyncio.gather(*tasks) to run all the tasks concurrently.
5. Use asyncio.run(main_coroutine) to start the entire process.
Example: Calling Multiple LLM Endpoints Concurrently
Let's write a full, runnable example. We'll hit a public test API,
httpbin.org`, to simulate calling different services.
# example-04-async-api-calls.py
import asyncio
import aiohttp
import time
We'll use a test API that can simulate delays
URLS = [
"https://httpbin.org/delay/2", # Simulates a 2-second response
"https://httpbin.org/delay/1", # Simulates a 1-second response
"https://httpbin.org/get", # A fast response
"https://httpbin.org/status/404", # An error response
]
async def fetch_url(session: aiohttp.ClientSession, url: str) -> dict:
"""A coroutine to fetch a single URL."""
print(f"Starting fetch for {url}...")
try:
async with session.get(url, timeout=10) as response:
# This will raise an exception for 4xx/5xx status codes
response.raise_for_status()
print(f"Finished fetch for {url} with status {response.status}")
# .json() is also a coroutine, so it must be awaited
return await response.json()
except aiohttp.ClientError as e:
print(f"Error fetching {url}: {e.status} {e.message}")
return {"error": True, "status": e.status, "url": url}
except asyncio.TimeoutError:
print(f"Timeout error fetching {url}")
return {"error": True, "status": "Timeout", "url": url}
async def main():
"""The main coroutine to orchestrate all the concurrent calls."""
start_time = time.time()
# Create a single session to be reused for all requests.
async with aiohttp.ClientSession() as session:
# Create a list of tasks to run. These don't start yet.
tasks = [fetch_url(session, url) for url in URLS]
# asyncio.gather runs all the tasks concurrently.
# return_exceptions=True ensures that one failed task doesn't stop the others.
results = await asyncio.gather(*tasks, return_exceptions=True)
end_time = time.time()
print("\n--- All tasks complete ---")
for result in results:
if isinstance(result, dict) and result.get("error"):
print(f"Failed Task: URL {result.get('url')}, Status: {result.get('status')}")
else:
# Process successful results
print(f"Successful Task: URL {result.get('url')}")
print(f"\nTotal execution time: {end_time - start_time:.2f} seconds")
print("(Note: This is close to the longest delay, not the sum of all delays!)")
if __name__ == "__main__":
# This is the entry point that starts the asyncio event loop.
asyncio.run(main())
When you run this script, you'll see that the total time is just over 2 seconds, proving that the calls ran in parallel. This is the power of async programming and is a fundamental technique for building high-performance AI applications.