Projections API¶

The projections module computes 2D projections from high-dimensional embedding vectors and clusters papers using K-Means with TF-IDF-based label generation.

Quick Example¶

from papertrail.projections import compute_projections, cluster_papers

# embeddings: np.ndarray of shape (n_papers, dim)
projections = compute_projections(embeddings)
# → {"pca": (n, 2), "tsne": (n, 2), "umap": (n, 2)}

texts = ["paper abstract one...", "paper abstract two..."]
cluster_ids, labels = cluster_papers(embeddings, texts, n_clusters=15)
# cluster_ids: np.ndarray of shape (n,) with int cluster assignments
# labels: {0: "Genomics / Regulation / Enhancer", 1: "Protein / Structure / Folding", ...}

Functions¶

compute_projections ¶

compute_projections(embeddings: ndarray, seed: int = 42) -> dict[str, ndarray]

Compute 2D projections from embedding matrix.

PARAMETER	DESCRIPTION
`embeddings`	Matrix of shape (n_papers, dim). TYPE: `ndarray`
`seed`	Random seed for reproducibility. TYPE: `int` DEFAULT: `42`

RETURNS	DESCRIPTION
`dict[str, ndarray]`	Keys "umap", "tsne", "pca", each mapping to (n_papers, 2) arrays.

compute_projections¶

Parameters:

Name	Type	Default	Description
`embeddings`	`np.ndarray`	(required)	Embedding matrix of shape `(n_papers, dim)`. Any dimensionality.
`seed`	`int`	`42`	Random seed for reproducibility across PCA, t-SNE, and UMAP.

Returns: dict[str, np.ndarray] with keys:

Key	Shape	Algorithm	Notes
`"pca"`	`(n, 2)`	PCA	Fast, linear. Reports explained variance ratio.
`"tsne"`	`(n, 2)`	t-SNE	`perplexity=min(30, n-1)`, `metric="cosine"`.
`"umap"`	`(n, 2)`	UMAP	`n_neighbors=15`, `min_dist=0.1`, `metric="cosine"`. Falls back to t-SNE if `umap-learn` is not installed.

cluster_papers ¶

cluster_papers(embeddings: ndarray, texts: list[str], n_clusters: int | str = 'auto', seed: int = 42, papers: list[dict] | None = None, projections: dict[str, ndarray] | None = None, cluster_on_projections: bool = True) -> tuple[ndarray, dict[int, str]]

Cluster papers and generate labels from TF-IDF top terms.

PARAMETER	DESCRIPTION
`embeddings`	Embedding matrix. TYPE: `ndarray`
`texts`	Paper texts for label generation. TYPE: `list[str]`
`n_clusters`	Number of clusters. "auto" uses silhouette-based estimation. TYPE: `int or 'auto'` DEFAULT: `'auto'`
`seed`	Random seed. TYPE: `int` DEFAULT: `42`
`papers`	Paper dicts for LLM label generation. TYPE: `list[dict]` DEFAULT: `None`
`projections`	2D projection arrays (e.g. {"umap": array, "tsne": array}). TYPE: `dict[str, ndarray]` DEFAULT: `None`
`cluster_on_projections`	If True and projections are available, cluster on the 2D projection (default: UMAP) instead of high-dimensional embeddings. This makes clusters align visually with the map. TYPE: `bool` DEFAULT: `True`

RETURNS	DESCRIPTION
`tuple[ndarray, dict[int, str]]`	Cluster IDs array and {cluster_id: label} dict.

cluster_papers¶

Parameters:

Name	Type	Default	Description
`embeddings`	`np.ndarray`	(required)	Embedding matrix of shape `(n_papers, dim)`.
`texts`	`list[str]`	(required)	Paper texts (title + abstract) for generating cluster labels via TF-IDF.
`n_clusters`	`int`	`10`	Number of K-Means clusters.
`seed`	`int`	`42`	Random seed.

Returns: tuple[np.ndarray, dict[int, str]]

Element	Type	Description
`cluster_ids`	`np.ndarray`	Integer array of shape `(n_papers,)`. Each value is a cluster ID from `0` to `n_clusters - 1`.
`labels`	`dict[int, str]`	Mapping of cluster ID → human-readable label. Labels are the top 3 TF-IDF terms for that cluster, title-cased and joined with " / " (e.g. `"Genomics / Regulation / Enhancer"`).

Dependencies¶

Package	Required	Notes
`scikit-learn`	Yes	PCA, t-SNE, K-Means, TF-IDF
`umap-learn`	Optional	If missing, UMAP slot uses a second t-SNE with different perplexity