Projections API¶
The projections module computes 2D projections from high-dimensional embedding vectors and clusters papers using K-Means with TF-IDF-based label generation.
Quick Example¶
from papertrail.projections import compute_projections, cluster_papers
# embeddings: np.ndarray of shape (n_papers, dim)
projections = compute_projections(embeddings)
# → {"pca": (n, 2), "tsne": (n, 2), "umap": (n, 2)}
texts = ["paper abstract one...", "paper abstract two..."]
cluster_ids, labels = cluster_papers(embeddings, texts, n_clusters=15)
# cluster_ids: np.ndarray of shape (n,) with int cluster assignments
# labels: {0: "Genomics / Regulation / Enhancer", 1: "Protein / Structure / Folding", ...}
Functions¶
compute_projections
¶
Compute 2D projections from embedding matrix.
| PARAMETER | DESCRIPTION |
|---|---|
embeddings
|
Matrix of shape (n_papers, dim).
TYPE:
|
seed
|
Random seed for reproducibility.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
dict[str, ndarray]
|
Keys "umap", "tsne", "pca", each mapping to (n_papers, 2) arrays. |
compute_projections¶
Parameters:
| Name | Type | Default | Description |
|---|---|---|---|
embeddings |
np.ndarray |
(required) | Embedding matrix of shape (n_papers, dim). Any dimensionality. |
seed |
int |
42 |
Random seed for reproducibility across PCA, t-SNE, and UMAP. |
Returns: dict[str, np.ndarray] with keys:
| Key | Shape | Algorithm | Notes |
|---|---|---|---|
"pca" |
(n, 2) |
PCA | Fast, linear. Reports explained variance ratio. |
"tsne" |
(n, 2) |
t-SNE | perplexity=min(30, n-1), metric="cosine". |
"umap" |
(n, 2) |
UMAP | n_neighbors=15, min_dist=0.1, metric="cosine". Falls back to t-SNE if umap-learn is not installed. |
cluster_papers
¶
cluster_papers(embeddings: ndarray, texts: list[str], n_clusters: int | str = 'auto', seed: int = 42, papers: list[dict] | None = None, projections: dict[str, ndarray] | None = None, cluster_on_projections: bool = True) -> tuple[ndarray, dict[int, str]]
Cluster papers and generate labels from TF-IDF top terms.
| PARAMETER | DESCRIPTION |
|---|---|
embeddings
|
Embedding matrix.
TYPE:
|
texts
|
Paper texts for label generation.
TYPE:
|
n_clusters
|
Number of clusters. "auto" uses silhouette-based estimation.
TYPE:
|
seed
|
Random seed.
TYPE:
|
papers
|
Paper dicts for LLM label generation.
TYPE:
|
projections
|
2D projection arrays (e.g. {"umap": array, "tsne": array}).
TYPE:
|
cluster_on_projections
|
If True and projections are available, cluster on the 2D projection (default: UMAP) instead of high-dimensional embeddings. This makes clusters align visually with the map.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
tuple[ndarray, dict[int, str]]
|
Cluster IDs array and {cluster_id: label} dict. |
cluster_papers¶
Parameters:
| Name | Type | Default | Description |
|---|---|---|---|
embeddings |
np.ndarray |
(required) | Embedding matrix of shape (n_papers, dim). |
texts |
list[str] |
(required) | Paper texts (title + abstract) for generating cluster labels via TF-IDF. |
n_clusters |
int |
10 |
Number of K-Means clusters. |
seed |
int |
42 |
Random seed. |
Returns: tuple[np.ndarray, dict[int, str]]
| Element | Type | Description |
|---|---|---|
cluster_ids |
np.ndarray |
Integer array of shape (n_papers,). Each value is a cluster ID from 0 to n_clusters - 1. |
labels |
dict[int, str] |
Mapping of cluster ID → human-readable label. Labels are the top 3 TF-IDF terms for that cluster, title-cased and joined with " / " (e.g. "Genomics / Regulation / Enhancer"). |
Dependencies¶
| Package | Required | Notes |
|---|---|---|
scikit-learn |
Yes | PCA, t-SNE, K-Means, TF-IDF |
umap-learn |
Optional | If missing, UMAP slot uses a second t-SNE with different perplexity |