Skip to content

Projections API

The projections module computes 2D projections from high-dimensional embedding vectors and clusters papers using K-Means with TF-IDF-based label generation.

Quick Example

from papertrail.projections import compute_projections, cluster_papers

# embeddings: np.ndarray of shape (n_papers, dim)
projections = compute_projections(embeddings)
# → {"pca": (n, 2), "tsne": (n, 2), "umap": (n, 2)}

texts = ["paper abstract one...", "paper abstract two..."]
cluster_ids, labels = cluster_papers(embeddings, texts, n_clusters=15)
# cluster_ids: np.ndarray of shape (n,) with int cluster assignments
# labels: {0: "Genomics / Regulation / Enhancer", 1: "Protein / Structure / Folding", ...}

Functions

compute_projections

compute_projections(embeddings: ndarray, seed: int = 42) -> dict[str, ndarray]

Compute 2D projections from embedding matrix.

PARAMETER DESCRIPTION
embeddings

Matrix of shape (n_papers, dim).

TYPE: ndarray

seed

Random seed for reproducibility.

TYPE: int DEFAULT: 42

RETURNS DESCRIPTION
dict[str, ndarray]

Keys "umap", "tsne", "pca", each mapping to (n_papers, 2) arrays.

compute_projections

Parameters:

Name Type Default Description
embeddings np.ndarray (required) Embedding matrix of shape (n_papers, dim). Any dimensionality.
seed int 42 Random seed for reproducibility across PCA, t-SNE, and UMAP.

Returns: dict[str, np.ndarray] with keys:

Key Shape Algorithm Notes
"pca" (n, 2) PCA Fast, linear. Reports explained variance ratio.
"tsne" (n, 2) t-SNE perplexity=min(30, n-1), metric="cosine".
"umap" (n, 2) UMAP n_neighbors=15, min_dist=0.1, metric="cosine". Falls back to t-SNE if umap-learn is not installed.

cluster_papers

cluster_papers(embeddings: ndarray, texts: list[str], n_clusters: int | str = 'auto', seed: int = 42, papers: list[dict] | None = None, projections: dict[str, ndarray] | None = None, cluster_on_projections: bool = True) -> tuple[ndarray, dict[int, str]]

Cluster papers and generate labels from TF-IDF top terms.

PARAMETER DESCRIPTION
embeddings

Embedding matrix.

TYPE: ndarray

texts

Paper texts for label generation.

TYPE: list[str]

n_clusters

Number of clusters. "auto" uses silhouette-based estimation.

TYPE: int or 'auto' DEFAULT: 'auto'

seed

Random seed.

TYPE: int DEFAULT: 42

papers

Paper dicts for LLM label generation.

TYPE: list[dict] DEFAULT: None

projections

2D projection arrays (e.g. {"umap": array, "tsne": array}).

TYPE: dict[str, ndarray] DEFAULT: None

cluster_on_projections

If True and projections are available, cluster on the 2D projection (default: UMAP) instead of high-dimensional embeddings. This makes clusters align visually with the map.

TYPE: bool DEFAULT: True

RETURNS DESCRIPTION
tuple[ndarray, dict[int, str]]

Cluster IDs array and {cluster_id: label} dict.

cluster_papers

Parameters:

Name Type Default Description
embeddings np.ndarray (required) Embedding matrix of shape (n_papers, dim).
texts list[str] (required) Paper texts (title + abstract) for generating cluster labels via TF-IDF.
n_clusters int 10 Number of K-Means clusters.
seed int 42 Random seed.

Returns: tuple[np.ndarray, dict[int, str]]

Element Type Description
cluster_ids np.ndarray Integer array of shape (n_papers,). Each value is a cluster ID from 0 to n_clusters - 1.
labels dict[int, str] Mapping of cluster ID → human-readable label. Labels are the top 3 TF-IDF terms for that cluster, title-cased and joined with " / " (e.g. "Genomics / Regulation / Enhancer").

Dependencies

Package Required Notes
scikit-learn Yes PCA, t-SNE, K-Means, TF-IDF
umap-learn Optional If missing, UMAP slot uses a second t-SNE with different perplexity