Embeddings API¶

The embeddings module computes vector representations of paper text using multiple backends. It also provides a FAISS-backed vector store for similarity search.

Quick Example¶

from papertrail.embeddings import embed_texts, VectorStore

# Embed paper texts (auto-detects best backend)
texts = ["Attention is all you need", "Deep learning for genomics"]
embeddings = embed_texts(texts)
# → np.ndarray of shape (2, dim)

# Force a specific backend
embeddings = embed_texts(texts, backend="tfidf")       # no API keys needed
embeddings = embed_texts(texts, backend="openai")       # requires OPENAI_API_KEY
embeddings = embed_texts(texts, backend="huggingface")  # requires HF_TOKEN

# Semantic search with FAISS
store = VectorStore()
store.build(embeddings, paper_ids=[0, 1], metadata={0: {"title": "..."}, 1: {"title": "..."}})
results = store.search_text("transformer architectures", top_k=5)

Backend Priority¶

Auto-detection checks for available backends in this order:

Priority	Backend	Env Variable	Model	Dimensions
1	`openai`	`OPENAI_API_KEY`	`text-embedding-3-small`	1536
2	`huggingface`	`HF_TOKEN`	`BAAI/bge-small-en-v1.5`	384
3	`local`	(fastembed installed)	`BAAI/bge-small-en-v1.5`	384
4	`tfidf`	(always available)	`tfidf-svd-128`	128

Functions¶

embed_texts ¶

embed_texts(texts: list[str], backend: Backend | None = None, model: str | None = None) -> ndarray

Embed a list of texts using the specified (or auto-detected) backend.

PARAMETER	DESCRIPTION
`texts`	The texts to embed. TYPE: `list[str]`
`backend`	One of "openai", "huggingface", "local". Auto-detected if omitted. TYPE: `str` DEFAULT: `None`
`model`	Model name override. Uses sensible defaults per backend. TYPE: `str` DEFAULT: `None`

RETURNS	DESCRIPTION
`ndarray`	Embedding matrix of shape (len(texts), dim).

embed_texts¶

Parameters:

Name	Type	Default	Description
`texts`	`list[str]`	(required)	Texts to embed. Each string is typically `"{title} {abstract}"`.
`backend`	`"openai" \\| "huggingface" \\| "local" \\| "tfidf" \\| None`	`None`	Embedding backend. Auto-detected if `None`.
`model`	`str \\| None`	`None`	Model name override. Uses default for the backend if `None`.

Returns: np.ndarray of shape (len(texts), dim) where dim depends on the backend/model.

VectorStore¶

VectorStore ¶

VectorStore(dimension: int | None = None)

FAISS-backed vector store for paper embeddings.

Stores embeddings alongside paper metadata for fast similarity search and retrieval.

Examples:

>>> store = VectorStore(dimension=384)
>>> store.add(embeddings, paper_ids)
>>> results = store.search("deep learning genomics", top_k=5)

build ¶

build(embeddings: ndarray, paper_ids: list[int], metadata: dict[int, dict] | None = None) -> None

Build the FAISS index from embeddings.

search ¶

search(query_embedding: ndarray, top_k: int = 10) -> list[dict]

Search for the top_k most similar papers.

PARAMETER	DESCRIPTION
`query_embedding`	Query vector of shape (dim,) or (1, dim). TYPE: `ndarray`
`top_k`	Number of results to return. TYPE: `int` DEFAULT: `10`

RETURNS	DESCRIPTION
`list[dict]`	List of {paper_id, score, **metadata} dicts.

search_text ¶

search_text(query: str, top_k: int = 10, backend: Backend | None = None, model: str | None = None) -> list[dict]

Search by text query — embeds the query then searches.

save ¶

save(path: str | Path) -> None

Save index + metadata to disk.

load ¶

load(path: str | Path) -> None

Load index + metadata from disk.

VectorStore Methods¶

`build(embeddings, paper_ids, metadata=None)`¶

Parameter	Type	Description
`embeddings`	`np.ndarray`	Matrix of shape `(n, dim)`. Automatically L2-normalized for cosine similarity.
`paper_ids`	`list[int]`	Integer ID for each paper (used as lookup keys).
`metadata`	`dict[int, dict] \\| None`	Optional mapping of `paper_id → {title, url, ...}` for search results.

Returns: None. Builds the FAISS index in-place.

`search(query_embedding, top_k=10)`¶

Parameter	Type	Description
`query_embedding`	`np.ndarray`	Query vector of shape `(dim,)` or `(1, dim)`.
`top_k`	`int`	Number of nearest neighbors to return.

Returns: list[dict] — each dict has paper_id (int), score (float, cosine similarity), plus any fields from metadata.

`search_text(query, top_k=10, backend=None, model=None)`¶

Convenience method that embeds the query string first, then searches.

Parameter	Type	Description
`query`	`str`	Natural language search query.
`top_k`	`int`	Number of results.
`backend`	`Backend \\| None`	Embedding backend (must match the one used to build the index).
`model`	`str \\| None`	Model override.

Returns: list[dict] — same format as search().

`save(path)` / `load(path)`¶

Parameter	Type	Description
`path`	`str \\| Path`	Directory to save/load from. Creates `index.faiss` and `metadata.json`.

TF-IDF Backend Details¶

The TF-IDF backend uses scikit-learn's TfidfVectorizer (5000 features, English stop words removed) followed by TruncatedSVD for dimensionality reduction. The model string encodes the target dimension:

"tfidf-svd-128" → 128 dimensions (default)
"tfidf-svd-256" → 256 dimensions

This backend requires no API keys and works offline, making it suitable for CI/CD pipelines and memory-constrained environments.