Embeddings & RAG
pgvector, similarity search, retrieval-augmented generation.
Your customer support docs are 800 pages. Nobody is reading them. You want a chatbot that answers questions using those docs, citing the right paragraphs. You cannot just paste 800 pages into every prompt (your wallet would file for divorce). The trick is retrieval-augmented generation (RAG): fetch only the relevant snippets, then ask the model. And to fetch by meaning instead of by keyword, you need embeddings.
What is an embedding?
An embedding is a list of numbers (a vector) that represents the meaning of a piece of text. Typically 768, 1536, or 3072 floating-point numbers. You generate it by sending text to an embedding model (OpenAI's text-embedding-3-small, Cohere's embed-v4, Voyage's voyage-3, and friends).
The magic property: texts with similar meanings produce vectors that point in similar directions. "The cat sat on the mat" and "A feline rested on the rug" end up close together in vector space. "PostgreSQL replication lag" ends up far away from both.
import { embed } from "ai";
const { embedding } = await embed({
model: "openai/text-embedding-3-small",
value: "The cat sat on the mat.",
});
console.log(embedding.length); // 1536
console.log(embedding.slice(0, 4));
// [0.0123, -0.0481, 0.0192, 0.0067, ...]Cosine similarity: comparing two vectors
To compare two embeddings you measure the cosine of the angle between them. 1.0 means identical direction (very similar). 0.0 means perpendicular (unrelated). -1.0 means opposite. Most real-world similarities sit between 0.2 and 0.9.
function cosineSimilarity(a: number[], b: number[]) {
let dot = 0, na = 0, nb = 0;
for (let i = 0; i < a.length; i++) {
dot += a[i] * b[i];
na += a[i] * a[i];
nb += b[i] * b[i];
}
return dot / (Math.sqrt(na) * Math.sqrt(nb));
}
// "What time does the store close?" vs corpus of FAQ entries
// Higher score = better match.The RAG pipeline in five steps
Every RAG system, from a 10-line script to a million-dollar SaaS, is some variation of this:
- Chunk: split documents into small pieces (300 to 800 tokens each). Too big and the search returns vague matches; too small and you lose context.
- Embed: run each chunk through an embedding model. You get one vector per chunk.
- Store: save the vector + the original text + any metadata (source URL, section, date) into a vector database.
- Retrieve: at query time, embed the user's question, then find the top-K most similar chunks.
- Augment: paste those chunks into the prompt as context, then ask the LLM to answer using only that context.
pgvector: Postgres as your vector database
You do not need a dedicated vector DB to start. pgvector is a Postgres extension that adds a vector column type and similarity operators. If you already have Postgres, you have a vector database. No new infra.
CREATE EXTENSION IF NOT EXISTS vector;
CREATE TABLE docs (
id bigserial PRIMARY KEY,
source text NOT NULL,
content text NOT NULL,
embedding vector(1536) NOT NULL
);
-- HNSW index for fast nearest-neighbor search
CREATE INDEX docs_embedding_idx
ON docs USING hnsw (embedding vector_cosine_ops);The query you run at retrieval time:
-- <=> is the cosine distance operator (1 - cosine similarity).
-- Lower distance means more similar.
SELECT id, source, content, embedding <=> $1 AS distance
FROM docs
ORDER BY embedding <=> $1
LIMIT 5;A minimal RAG flow in TypeScript
import { embed, generateText } from "ai";
import { sql } from "./db"; // your postgres client
// 1. INDEX TIME, one-off
async function indexDoc(source: string, content: string) {
const chunks = chunk(content, 500); // your own chunker
for (const chunk of chunks) {
const { embedding } = await embed({
model: "openai/text-embedding-3-small",
value: chunk,
});
await sql`
INSERT INTO docs (source, content, embedding)
VALUES (${source}, ${chunk}, ${JSON.stringify(embedding)})
`;
}
}
// 2. QUERY TIME, every user question
async function ask(question: string) {
const { embedding } = await embed({
model: "openai/text-embedding-3-small",
value: question,
});
const hits = await sql`
SELECT content, source
FROM docs
ORDER BY embedding <=> ${JSON.stringify(embedding)}
LIMIT 4
`;
const context = hits
.map((h, i) => "[" + (i + 1) + "] " + h.content)
.join("\n\n");
const { text } = await generateText({
model: "anthropic/claude-opus-4-5",
system:
"Answer using only the provided context. Cite sources with [n].",
prompt: "Context:\n" + context + "\n\nQuestion: " + question,
});
return { text, sources: hits.map((h) => h.source) };
}Vector search vs full-text search
Vector search shines at semantic queries: "how do I cancel my plan" finds "subscription termination". Full-text search (Postgres' built-in tsvector, Elasticsearch, Meili) shines at exact terms: SKUs, product codes, names, error messages.
- Use vector when phrasing varies a lot and synonyms matter (support docs, knowledge bases, recommendations).
- Use full-text when users search by exact tokens (legal documents, code symbols, product catalogs).
- Use hybrid when you want both. Production RAG systems combine vector + keyword scores with a reranker on top.
Quick quiz
What does an embedding represent?
Recap
- An embedding is a vector that captures meaning. Similar texts produce similar vectors.
- Cosine similarity measures how aligned two vectors are. Higher = more semantically related.
- RAG = chunk + embed + store + retrieve + augment. You feed retrieved snippets to the LLM as context.
pgvectorturns Postgres into a vector database with one extension and an HNSW index.- Use vector search for semantic queries, full-text for exact terms, and hybrid when you can afford the complexity.