TUTORIAL

Vector Databases for AI

Store and query embeddings efficiently

Vector Databases for AI: A Production-Ready Tutorial

1. Brief Overview

Vector databases are a specialized type of database designed to store, manage, and query high-dimensional data, known as vector embeddings, which are at the core of many modern AI applications. Unlike traditional relational databases that are optimized for structured data and exact matches, vector databases excel at similarity searching. They enable developers to find the "nearest neighbors" to a given query vector, which translates to finding the most semantically similar items. This capability is crucial for building applications that understand context and relationships within data, rather than just matching keywords.

The importance of vector databases has surged with the rise of large language models (LLMs) and other deep learning models that generate embeddings. These embeddings are numerical representations of unstructured data like text, images, or audio. For instance, an LLM can convert a sentence into a vector that captures its meaning. By storing these vectors in a specialized database, you can perform powerful searches like finding articles similar to a given one, recommending products based on user preferences, or providing relevant context to an LLM for more accurate responses (a technique known as Retrieval-Augmented Generation or RAG).

This tutorial is for software engineers, data scientists, and AI practitioners who want to build applications that leverage the power of semantic search and similarity. Whether you're developing a recommendation engine, a question-answering system, or a multimodal search application, understanding and using vector databases is becoming an essential skill. We will explore two popular managed vector database solutions, Pinecone and Weaviate, with practical, hands-on examples to get you started.

2. Key Concepts

Before diving into the practical examples, let's clarify some fundamental concepts:

3. Practical Code Examples

We'll walk through setting up and using both Pinecone and Weaviate. For both examples, we'll use OpenAI's API to generate embeddings.

3.1. Pinecone Example

Pinecone is a popular, fully managed vector database.

Step 1: Get API Keys

  1. Pinecone: Sign up for a free account at pinecone.io. In the console, create a project and an index. Then, go to "API Keys" to get your API key and environment name.
  2. OpenAI: You'll need an OpenAI API key to generate embeddings. Get one from the OpenAI Platform.

Step 2: Installation

Install the necessary Python libraries:


pip install pinecone-client openai

Step 3: Complete Code

Create a Python file named pinecone_example.py:


import os
from pinecone import Pinecone
import openai

# --- Configuration ---
# It's a best practice to use environment variables for API keys.
# You can set them in your terminal like this:
# export PINECONE_API_KEY="YOUR_PINECONE_API_KEY"
# export PINECONE_ENVIRONMENT="YOUR_PINECONE_ENVIRONMENT"
# export OPENAI_API_KEY="YOUR_OPENAI_API_KEY"

PINECONE_API_KEY = os.environ.get("PINECONE_API_KEY")
PINECONE_ENVIRONMENT = os.environ.get("PINECONE_ENVIRONMENT")
OPENAI_API_KEY = os.environ.get("OPENAI_API_KEY")

INDEX_NAME = "my-first-index"
VECTOR_DIMENSION = 1536  # For OpenAI's "text-embedding-ada-002" model

# --- Initialize Clients ---
pinecone = Pinecone(api_key=PINECONE_API_KEY, environment=PINECONE_ENVIRONMENT)
openai.api_key = OPENAI_API_KEY

# --- Create Index (if it doesn't exist) ---
if INDEX_NAME not in pinecone.list_indexes().names():
    print(f"Creating index: {INDEX_NAME}")
    pinecone.create_index(
        name=INDEX_NAME,
        dimension=VECTOR_DIMENSION,
        metric="cosine"
    )
    print("Index created successfully.")
else:
    print(f"Index {INDEX_NAME} already exists.")

# --- Connect to the Index ---
index = pinecone.Index(INDEX_NAME)
print(f"Connected to index: {INDEX_NAME}")
print(f"Index stats: {index.describe_index_stats()}")

# --- Generate Embeddings and Upsert Data ---
def get_embedding(text, model="text-embedding-ada-002"):
   text = text.replace("\n", " ")
   return openai.Embedding.create(input=[text], model=model)['data'][0]['embedding']

data_to_upsert = [
    {"id": "doc1", "text": "The sky is blue."},
    {"id": "doc2", "text": "The sun is bright."},
    {"id": "doc3", "text": "The ocean is vast and blue."},
    {"id": "doc4", "text": "He is a bright student."}
]

print("\nUpserting data...")
for item in data_to_upsert:
    embedding = get_embedding(item["text"])
    index.upsert(
        vectors=[
            {
                "id": item["id"],
                "values": embedding,
                "metadata": {"text": item["text"]}
            }
        ]
    )
print("Data upserted successfully.")
print(f"Index stats after upsert: {index.describe_index_stats()}")


# --- Query the Index ---
query_text = "What color is the sky?"
query_embedding = get_embedding(query_text)

print(f"\nQuerying for: '{query_text}'")
query_results = index.query(
    vector=query_embedding,
    top_k=2,
    include_metadata=True
)

print("\nQuery Results:")
for result in query_results['matches']:
    print(f"  - ID: {result['id']}, Score: {result['score']:.4f}, Text: {result['metadata']['text']}")

# --- Clean up (optional) ---
# print(f"\nDeleting index: {INDEX_NAME}")
# pinecone.delete_index(INDEX_NAME)
# print("Index deleted.")

Step 4: Expected Output


Creating index: my-first-index
Index created successfully.
Connected to index: my-first-index
Index stats: {'dimension': 1536, 'index_fullness': 0.0, 'namespaces': {}, 'total_vector_count': 0}

Upserting data...
Data upserted successfully.
Index stats after upsert: {'dimension': 1536, 'index_fullness': 0.0, 'namespaces': {'': {'vector_count': 4}}, 'total_vector_count': 4}

Querying for: 'What color is the sky?'

Query Results:
  - ID: doc1, Score: 0.9234, Text: The sky is blue.
  - ID: doc3, Score: 0.8571, Text: The ocean is vast and blue.

3.2. Weaviate Example

Weaviate is a powerful open-source vector database that can be self-hosted or used as a managed service. We'll use the managed Weaviate Cloud Service (WCS) for this example.

Step 1: Get API Keys and Cluster URL

  1. Weaviate: Sign up for a free account at Weaviate Cloud Console. Create a new cluster. From the cluster's "Details" page, you'll find your Cluster URL and API Key.
  2. OpenAI: You'll need an OpenAI API key.

Step 2: Installation


pip install weaviate-client openai

Step 3: Complete Code

Create a Python file named weaviate_example.py:


import os
import weaviate
import openai

# --- Configuration ---
# export WEAVIATE_CLUSTER_URL="YOUR_WEAVIATE_CLUSTER_URL"
# export WEAVIATE_API_KEY="YOUR_WEAVIATE_API_KEY"
# export OPENAI_API_KEY="YOUR_OPENAI_API_KEY"

WEAVIATE_CLUSTER_URL = os.environ.get("WEAVIATE_CLUSTER_URL")
WEAVIATE_API_KEY = os.environ.get("WEAVIATE_API_KEY")
OPENAI_API_KEY = os.environ.get("OPENAI_API_KEY")

CLASS_NAME = "MyDocument"

# --- Initialize Clients ---
client = weaviate.Client(
    url=WEAVIATE_CLUSTER_URL,
    auth_client_secret=weaviate.AuthApiKey(api_key=WEAVIATE_API_KEY),
    additional_headers={
        "X-OpenAI-Api-Key": OPENAI_API_KEY
    }
)
openai.api_key = OPENAI_API_KEY

# --- Create Schema (Class Definition) ---
if not client.schema.exists(CLASS_NAME):
    print(f"Creating class: {CLASS_NAME}")
    class_obj = {
        "class": CLASS_NAME,
        "vectorizer": "text2vec-openai",
        "moduleConfig": {
            "text2vec-openai": {
                "model": "ada",
                "type": "text"
            }
        },
        "properties": [
            {
                "name": "text",
                "dataType": ["text"]
            }
        ]
    }
    client.schema.create_class(class_obj)
    print("Class created successfully.")
else:
    print(f"Class {CLASS_NAME} already exists.")

# --- Batch Import Data ---
data_to_import = [
    {"text": "The sky is blue."},
    {"text": "The sun is bright."},
    {"text": "The ocean is vast and blue."},
    {"text": "He is a bright student."}
]

print("\nImporting data...")
with client.batch as batch:
    for item in data_to_import:
        batch.add_data_object(
            data_object=item,
            class_name=CLASS_NAME
        )
print("Data imported successfully.")

# --- Query the Data (Near Text Search) ---
query_text = "What color is the sky?"

print(f"\nQuerying for: '{query_text}'")
response = (
    client.query
    .get(CLASS_NAME, ["text"])
    .with_near_text({"concepts": [query_text]})
    .with_limit(2)
    .do()
)

print("\nQuery Results:")
for item in response['data']['Get'][CLASS_NAME]:
    print(f"  - Text: {item['text']}")

# --- Clean up (optional) ---
# print(f"\nDeleting class: {CLASS_NAME}")
# client.schema.delete_class(CLASS_NAME)
# print("Class deleted.")

Step 4: Expected Output


Creating class: MyDocument
Class created successfully.

Importing data...
Data imported successfully.

Querying for: 'What color is the sky?'

Query Results:
  - Text: The sky is blue.
  - Text: The ocean is vast and blue.

4. Best Practices

  1. Choose the Right Vector Dimension: The dimension of your vectors should match the output of your embedding model. Using a different dimension will result in errors. For example, OpenAI's text-embedding-ada-002 produces 1536-dimensional vectors.
  1. Normalize Your Vectors: For some distance metrics like cosine similarity, it's a best practice to normalize your vectors to unit length. This can improve search performance and accuracy. Many embedding models already output normalized vectors.
  1. Batch Your Inserts: When inserting a large amount of data, always use batching. Inserting vectors one by one is much slower due to network overhead. Both Pinecone and Weaviate provide batching mechanisms.
  1. Use Metadata Filtering: For production applications, you'll almost always need to filter by metadata. Design your data schema with this in mind. For example, store user IDs, product categories, or timestamps as metadata.
  1. Monitor Your Index: Keep an eye on the size and performance of your index. As your data grows, you may need to scale up your resources. Managed services like Pinecone and Weaviate provide dashboards for monitoring.
  1. Experiment with topk: The topk parameter determines how many results to return. A larger topk will be slower. For applications like RAG, you might only need a small topk (e.g., 3-5). For recommendation engines, you might want a larger top_k.
  1. Separate Indexes for Different Data Types: If you're working with different types of data (e.g., text and images), it's often a good idea to store them in separate indexes, as their embeddings will have different characteristics.

5. Common Pitfalls to Avoid

  1. Mismatched Vector Dimensions
  2. Error Message (Pinecone): Vector dimension 768 does not match the dimension of the index 1536
  3. Fix: Ensure the dimension parameter when creating your index matches the output dimension of your embedding model.
  1. API Key Errors
  2. Error Message (OpenAI): openai.error.AuthenticationError: Incorrect API key provided
  3. Fix: Double-check that your API keys are correct and that you've set them as environment variables or passed them correctly to the client. Make sure there are no typos or extra spaces.
  1. Index Not Found
  2. Error Message (Pinecone): pinecone.exceptions.NotFoundException: Index my-first-index not found
  3. Fix: Make sure you've created the index before trying to connect to it. The provided code examples include a check to create the index if it doesn't exist.
  1. Data Type Mismatch in Metadata
  2. Error Message: This can vary, but you might see errors related to filtering or data serialization.
  3. Fix: Ensure the data types of your metadata match what the database expects (e.g., strings, numbers, booleans).

6. Next Steps and Additional Resources