Embeddings & RAG

pgvector, similarity search, retrieval-augmented generation.

Your customer support docs are 800 pages. Nobody is reading them. You want a chatbot that answers questions using those docs, citing the right paragraphs. You cannot just paste 800 pages into every prompt (your wallet would file for divorce). The trick is retrieval-augmented generation (RAG): fetch only the relevant snippets, then ask the model. And to fetch by meaning instead of by keyword, you need embeddings.

What is an embedding?

An embedding is a list of numbers (a vector) that represents the meaning of a piece of text. Typically 768, 1536, or 3072 floating-point numbers. You generate it by sending text to an embedding model (OpenAI's text-embedding-3-small, Cohere's embed-v4, Voyage's voyage-3, and friends).

The magic property: texts with similar meanings produce vectors that point in similar directions. "The cat sat on the mat" and "A feline rested on the rug" end up close together in vector space. "PostgreSQL replication lag" ends up far away from both.

import { embed } from "ai";

const { embedding } = await embed({
  model: "openai/text-embedding-3-small",
  value: "The cat sat on the mat.",
});

console.log(embedding.length); // 1536
console.log(embedding.slice(0, 4));
// [0.0123, -0.0481, 0.0192, 0.0067, ...]

Embeddings are not interpretable

Do not try to read individual numbers. Dimension 472 does not mean "catness." The semantics live in the geometry of the whole vector, not in any one axis. Treat them like a black box.

Cosine similarity: comparing two vectors

To compare two embeddings you measure the cosine of the angle between them. 1.0 means identical direction (very similar). 0.0 means perpendicular (unrelated). -1.0 means opposite. Most real-world similarities sit between 0.2 and 0.9.

function cosineSimilarity(a: number[], b: number[]) {
  let dot = 0, na = 0, nb = 0;
  for (let i = 0; i < a.length; i++) {
    dot += a[i] * b[i];
    na += a[i] * a[i];
    nb += b[i] * b[i];
  }
  return dot / (Math.sqrt(na) * Math.sqrt(nb));
}

// "What time does the store close?" vs corpus of FAQ entries
// Higher score = better match.

The RAG pipeline in five steps

Every RAG system, from a 10-line script to a million-dollar SaaS, is some variation of this:

Chunk: split documents into small pieces (300 to 800 tokens each). Too big and the search returns vague matches; too small and you lose context.
Embed: run each chunk through an embedding model. You get one vector per chunk.
Store: save the vector + the original text + any metadata (source URL, section, date) into a vector database.
Retrieve: at query time, embed the user's question, then find the top-K most similar chunks.
Augment: paste those chunks into the prompt as context, then ask the LLM to answer using only that context.

Always cite

The whole point of RAG is grounding. Make the model output a citation (chunk ID, URL, page number) so users can verify the source. Models still hallucinate; citations let humans catch them.

pgvector: Postgres as your vector database

You do not need a dedicated vector DB to start. pgvector is a Postgres extension that adds a vector column type and similarity operators. If you already have Postgres, you have a vector database. No new infra.

schema.sql

CREATE EXTENSION IF NOT EXISTS vector;

CREATE TABLE docs (
  id          bigserial PRIMARY KEY,
  source      text NOT NULL,
  content     text NOT NULL,
  embedding   vector(1536) NOT NULL
);

-- HNSW index for fast nearest-neighbor search
CREATE INDEX docs_embedding_idx
  ON docs USING hnsw (embedding vector_cosine_ops);

The query you run at retrieval time:

sql

-- <=> is the cosine distance operator (1 - cosine similarity).
-- Lower distance means more similar.
SELECT id, source, content, embedding <=> $1 AS distance
FROM docs
ORDER BY embedding <=> $1
LIMIT 5;

A minimal RAG flow in TypeScript

rag.ts

import { embed, generateText } from "ai";
import { sql } from "./db"; // your postgres client

// 1. INDEX TIME, one-off
async function indexDoc(source: string, content: string) {
  const chunks = chunk(content, 500); // your own chunker
  for (const chunk of chunks) {
    const { embedding } = await embed({
      model: "openai/text-embedding-3-small",
      value: chunk,
    });
    await sql`
      INSERT INTO docs (source, content, embedding)
      VALUES (${source}, ${chunk}, ${JSON.stringify(embedding)})
    `;
  }
}

// 2. QUERY TIME, every user question
async function ask(question: string) {
  const { embedding } = await embed({
    model: "openai/text-embedding-3-small",
    value: question,
  });

  const hits = await sql`
    SELECT content, source
    FROM docs
    ORDER BY embedding <=> ${JSON.stringify(embedding)}
    LIMIT 4
  `;

  const context = hits
    .map((h, i) => "[" + (i + 1) + "] " + h.content)
    .join("\n\n");

  const { text } = await generateText({
    model: "anthropic/claude-opus-4-5",
    system:
      "Answer using only the provided context. Cite sources with [n].",
    prompt: "Context:\n" + context + "\n\nQuestion: " + question,
  });

  return { text, sources: hits.map((h) => h.source) };
}

Vector search vs full-text search

Vector search shines at semantic queries: "how do I cancel my plan" finds "subscription termination". Full-text search (Postgres' built-in tsvector, Elasticsearch, Meili) shines at exact terms: SKUs, product codes, names, error messages.

Use vector when phrasing varies a lot and synonyms matter (support docs, knowledge bases, recommendations).
Use full-text when users search by exact tokens (legal documents, code symbols, product catalogs).
Use hybrid when you want both. Production RAG systems combine vector + keyword scores with a reranker on top.

RAG is not a magic bullet

Bad chunks in, bad answers out. The hard work in RAG is not the embedding step. It is chunking, deduplication, source quality, reranking, and prompt engineering around citations. Plan for several rounds of evaluation.

Quick quiz

Quiz1 / 3

What does an embedding represent?

Recap

An embedding is a vector that captures meaning. Similar texts produce similar vectors.
Cosine similarity measures how aligned two vectors are. Higher = more semantically related.
RAG = chunk + embed + store + retrieve + augment. You feed retrieved snippets to the LLM as context.
pgvector turns Postgres into a vector database with one extension and an HNSW index.
Use vector search for semantic queries, full-text for exact terms, and hybrid when you can afford the complexity.