Computing Embeddings¶

The embeddings module generates semantic vector representations of papers and computes 2D projections, clusters, and a FAISS index for similarity search.

How It Works¶

The embedder:

Extracts text from paper abstracts, titles, and keywords
Generates embeddings using your chosen backend:
OpenAI text-embedding-3-small (default)
HuggingFace BAAI/bge-small-en-v1.5
Local fastembed (no API key)
Computes 2D projections for visualization:
UMAP (recommended)
t-SNE
PCA
Clusters papers using k-means
Builds FAISS index for fast similarity search
Exports complete dataset with all computed features

Basic Usage¶

Embed Papers¶

papertrail embed papers_enriched.json -o papers_final.json --backend openai

Computes embeddings and saves to papers_final.json.

Choose Backend¶

OpenAI (recommended, default)HuggingFaceLocal (no API key)

papertrail embed papers_enriched.json -o papers_final.json --backend openai

papertrail embed papers_enriched.json -o papers_final.json --backend huggingface

papertrail embed papers_enriched.json -o papers_final.json --backend local

Dry Run¶

Preview embeddings without saving:

papertrail embed papers_enriched.json --dry-run

Verbose Output¶

See detailed progress:

papertrail embed papers_enriched.json -o papers_final.json -v

Embedding Backends¶

Choose based on your needs:

Feature	OpenAI	HuggingFace	Local
Model	`text-embedding-3-small`	`BAAI/bge-small-en-v1.5`	`BAAI/bge-small-en-v1.5`
Dimensions	1536	384	384
Quality	Excellent	Very Good	Very Good
Speed	Very Fast	Fast	Medium
Cost	$0.02/million tokens	Free	Free
API Key	`OPENAI_API_KEY`	`HF_TOKEN`	None
Requires Network	Yes	Yes	No

OpenAI¶

Best overall quality and speed. Requires OPENAI_API_KEY.

export OPENAI_API_KEY="sk-..."
papertrail embed papers.json -o papers_final.json --backend openai

Pros: - Highest quality embeddings - Very fast - 1536-dimensional space captures fine-grained semantics

Cons: - Costs $0.02 per million tokens (~$2 for 10,000 papers) - Requires API key and internet connection - Subject to rate limits

Cost estimate: - 100 papers: ~$0.01 - 1,000 papers: ~$0.10 - 10,000 papers: ~$1.00

HuggingFace Inference API¶

Good quality, free but requires HF_TOKEN for higher rate limits.

export HF_TOKEN="hf_..."
papertrail embed papers.json -o papers_final.json --backend huggingface

Pros: - Free (after HuggingFace account creation) - Good quality - Can use without API key (limited rate limit)

Cons: - Slightly lower quality than OpenAI - Requires internet connection - Rate limits can be restrictive on free tier

Local ONNX Models¶

No API key required. Best for privacy and offline use.

papertrail embed papers.json -o papers_final.json --backend local

Pros: - No API keys required - Completely offline - Best for sensitive data - Full control over models

Cons: - Slower (10-30 seconds for 100 papers) - Uses more CPU/memory - First run downloads ~100MB model

Output Format¶

Embedded papers include vectors and projections:

{
  "papers": [
    {
      "doi": "10.1038/nature12373",
      "title": "Deep residual learning for image recognition",
      "abstract": "...",
      "embedding": [0.123, -0.456, 0.789, ...],
      "cluster": 3,
      "umap_x": 2.34,
      "umap_y": -1.56,
      "tsne_x": 45.2,
      "tsne_y": 32.1,
      "pca_x": 1.23,
      "pca_y": 0.45,
      "channel": "papers-ai",
      "timestamp": 1234567890
    }
  ],
  "metadata": {
    "embedding_backend": "openai",
    "embedding_model": "text-embedding-3-small",
    "embedding_dimensions": 1536,
    "n_clusters": 5,
    "clustering_method": "kmeans"
  }
}

New Fields¶

Field	Type	Description
`embedding`	array	Semantic vector (1536 for OpenAI, 384 for others)
`cluster`	integer	Cluster assignment (0-based)
`umap_x`, `umap_y`	float	2D UMAP coordinates
`tsne_x`, `tsne_y`	float	2D t-SNE coordinates
`pca_x`, `pca_y`	float	2D PCA coordinates

Advanced Options¶

Number of Clusters¶

Control k-means clustering:

# Use 10 clusters instead of default 5
papertrail embed papers.json -o papers_final.json --n-clusters 10

Projection Methods¶

Choose which 2D projections to compute (all by default):

# Only compute UMAP (faster)
papertrail embed papers.json -o papers_final.json --projections umap

# Multiple projections
papertrail embed papers.json -o papers_final.json --projections umap,tsne

Available: umap, tsne, pca

Batch Size¶

Control embedding batch size:

# Larger batches = faster but use more memory
papertrail embed papers.json -o papers_final.json --batch-size 128

Delay Between Requests¶

Add delays for rate-limited APIs:

# 1 second delay between requests
papertrail embed papers.json -o papers_final.json --delay 1.0

Text Fields¶

Control which fields are used for embeddings:

# Use abstract only (default uses abstract + title)
papertrail embed papers.json -o papers_final.json --text-fields abstract

# Use multiple fields
papertrail embed papers.json -o papers_final.json --text-fields abstract,title,keywords

FAISS Index Options¶

Control FAISS index creation:

# Use different distance metric
papertrail embed papers.json -o papers_final.json --faiss-metric ip  # inner product

# Save FAISS index separately
papertrail embed papers.json -o papers_final.json --faiss-path ./faiss_index/

Using Embeddings¶

Similarity Search¶

from papertrail.embeddings import VectorStore
import numpy as np

# Load FAISS index
store = VectorStore()
store.load("faiss_index/")

# Search by text
results = store.search_text("transformer attention mechanisms", top_k=5)
for r in results:
    print(f"[{r['score']:.3f}] {r['title']}")

# Search by vector
query_vector = np.array([...])  # your embedding
results = store.search_vector(query_vector, top_k=5)

Analyze Clusters¶

import json
from collections import Counter

with open("papers_final.json") as f:
    papers = json.load(f)["papers"]

# Papers per cluster
clusters = Counter(p["cluster"] for p in papers)
print(f"Clusters: {sorted(clusters.items())}")

# Papers in cluster 0
cluster_0 = [p for p in papers if p["cluster"] == 0]
print(f"Cluster 0: {len(cluster_0)} papers")
for p in cluster_0[:3]:
    print(f"  - {p['title']}")

Visualize Embeddings¶

import json
import matplotlib.pyplot as plt

with open("papers_final.json") as f:
    papers = json.load(f)["papers"]

# Extract coordinates
x = [p["umap_x"] for p in papers]
y = [p["umap_y"] for p in papers]
clusters = [p["cluster"] for p in papers]

# Plot
plt.figure(figsize=(12, 8))
scatter = plt.scatter(x, y, c=clusters, cmap="tab10", alpha=0.6)
plt.colorbar(scatter, label="Cluster")
plt.xlabel("UMAP X")
plt.ylabel("UMAP Y")
plt.title("Paper Embeddings")
plt.tight_layout()
plt.savefig("embeddings_plot.png")

Python API¶

Use embeddings programmatically:

from papertrail.embeddings import Embedder

# Create embedder
embedder = Embedder(backend="openai", verbose=True)

# Embed papers
papers = embedder.embed(
    papers,
    n_clusters=5,
    projections=["umap", "tsne"]
)

# Get embeddings
for paper in papers:
    vector = paper["embedding"]
    print(f"{paper['title']}: {len(vector)} dimensions")

# Build FAISS index
faiss_index = embedder.build_faiss_index(papers)
faiss_index.save("faiss_index/")

Tips & Tricks¶

Embedding Strategies¶

For semantic similarity: Use abstracts + titles with OpenAI for best results.

For speed: Use local backend with just abstracts.

For research: Experiment with multiple backends and compare cluster assignments.

Handling Large Datasets¶

For 10,000+ papers: 1. Use local backend for speed 2. Increase batch size to 256 3. Skip some projections: --projections umap 4. Run on GPU-enabled machine if available

Re-embedding Specific Papers¶

If you've updated some papers, re-embed them:

# Re-embed all papers
papertrail embed papers_enriched.json -o papers_final.json --no-cache

Combining Embeddings¶

If you have multiple embedding sets, combine them:

import json
import numpy as np

# Load both
with open("openai_final.json") as f:
    openai_papers = json.load(f)["papers"]
with open("local_final.json") as f:
    local_papers = json.load(f)["papers"]

# Average embeddings (if same papers)
combined = []
for op, lp in zip(openai_papers, local_papers):
    op_embed = np.array(op["embedding"])
    lp_embed = np.array(lp["embedding"])

    # Normalize
    op_embed = op_embed / np.linalg.norm(op_embed)
    lp_embed = lp_embed / np.linalg.norm(lp_embed)

    # Average
    combined_embed = (op_embed + lp_embed) / 2
    op["embedding"] = combined_embed.tolist()
    combined.append(op)

# Save
with open("combined_final.json", "w") as f:
    json.dump({"papers": combined}, f)

Troubleshooting¶

Error: `OPENAI_API_KEY not found`¶

Set your API key:

export OPENAI_API_KEY="sk-..."

Or use a different backend:

papertrail embed papers.json -o papers_final.json --backend local

Error: `Out of memory`¶

Reduce batch size:

papertrail embed papers.json -o papers_final.json --batch-size 32

Or skip some projections:

papertrail embed papers.json -o papers_final.json --projections umap

Error: `Rate limit exceeded`¶

Add delays:

papertrail embed papers.json -o papers_final.json --delay 2.0

Embeddings seem low quality¶

Check:

You're using the right backend (OpenAI is best)
Papers have abstracts (required for good embeddings)
Text is in English (models are trained on English)

Try a different backend for comparison.

Next Steps¶

Building the Dashboard — Visualize embeddings
Searching Papers — Use FAISS for semantic search
API Reference: Embeddings — Detailed Python API