Data API¶

The data module downloads and manages pre-scraped paper datasets from PaperTrail GitHub Releases. This lets you skip the Slack scraping step and jump straight to enrichment, embedding, or analysis.

Quick Example¶

from papertrail.data import download_release, load_papers, list_releases

# See what's available
for r in list_releases():
    print(r["tag"], r["date"], len(r["assets"]), "assets")

# Download the latest release (auto-detected)
data_dir = download_release()
# → ~/.papertrail/data/v0.1.0-data-2026-04-04/

# Load papers
papers = load_papers()
print(f"{len(papers)} papers loaded")

# Load a specific dataset
enriched = load_papers(which="enriched")
scrapes = load_papers(which="scrapes")  # dict of {filename: data}

Functions¶

list_releases ¶

list_releases(repo: str = GITHUB_REPO, data_only: bool = True) -> list[dict[str, Any]]

List available data releases from the GitHub repository.

PARAMETER	DESCRIPTION
`repo`	GitHub repository in `owner/name` format. TYPE: `str` DEFAULT: `GITHUB_REPO`
`data_only`	If True, only return releases whose tag contains `data` (filters out code-only releases). TYPE: `bool` DEFAULT: `True`

RETURNS	DESCRIPTION
`list[dict]`	Each dict has keys: `tag`, `date`, `description`, `assets` (list of filenames), `url` (HTML URL).

get_latest_release ¶

get_latest_release(repo: str = GITHUB_REPO) -> Optional[dict[str, Any]]

Get the most recent data release.

PARAMETER	DESCRIPTION
`repo`	GitHub repository in `owner/name` format. TYPE: `str` DEFAULT: `GITHUB_REPO`

RETURNS	DESCRIPTION
`dict or None`	Release info dict (same format as `list_releases`), or None if no data releases exist.

download_release ¶

download_release(tag: Optional[str] = None, data_dir: Optional[str | Path] = None, repo: str = GITHUB_REPO, assets: Optional[list[str]] = None, force: bool = False) -> Path

Download release assets from GitHub and decompress them.

PARAMETER	DESCRIPTION
`tag`	Release tag (e.g. `"v0.1.0-data-2026-04-04"`). If None, downloads the latest data release. TYPE: `str` DEFAULT: `None`
`data_dir`	Directory to store downloaded files. Defaults to `~/.papertrail/data/{tag}/`. TYPE: `str or Path` DEFAULT: `None`
`repo`	GitHub repository in `owner/name` format. TYPE: `str` DEFAULT: `GITHUB_REPO`
`assets`	Specific asset filenames to download. If None, downloads all assets in the release. TYPE: `list[str]` DEFAULT: `None`
`force`	If True, re-download even if files already exist locally. TYPE: `bool` DEFAULT: `False`

RETURNS	DESCRIPTION
`Path`	Directory containing the downloaded (and decompressed) files.

RAISES	DESCRIPTION
`ValueError`	If no matching release is found.
`HTTPError`	If a download fails.

load_papers ¶

load_papers(data_dir: Optional[str | Path] = None, tag: Optional[str] = None, which: str = 'merged') -> list[dict[str, Any]]

Load papers from a downloaded release.

PARAMETER	DESCRIPTION
`data_dir`	Directory containing downloaded data files. If None, uses `~/.papertrail/data/{tag}/`. TYPE: `str or Path` DEFAULT: `None`
`tag`	Release tag. Used to locate the data directory if `data_dir` is not specified. If both are None, uses the latest release. TYPE: `str` DEFAULT: `None`
`which`	Which dataset to load: `"merged"` — raw merged papers (`all_papers_merged.json`) `"enriched"` — papers with metadata (`enrich_checkpoint.json` or `papers_enriched.json`) `"final"` — papers with embeddings (`papers_final.json`) `"scrapes"` — per-channel scrapes (returns a dict of `{filename: data}`) TYPE: `str` DEFAULT: `'merged'`

RETURNS	DESCRIPTION
`list[dict] or dict`	Paper records. For `which="scrapes"`, returns a dict mapping filenames to their contents.

RAISES	DESCRIPTION
`FileNotFoundError`	If the data directory or expected files don't exist. Suggests running `download_release()` first.

data_summary ¶

data_summary(data_dir: Optional[str | Path] = None, tag: Optional[str] = None) -> dict[str, Any]

Summarize available data in a release directory.

PARAMETER	DESCRIPTION
`data_dir`	Directory to inspect. TYPE: `str or Path` DEFAULT: `None`
`tag`	Release tag (used to locate directory if data_dir is None). TYPE: `str` DEFAULT: `None`

RETURNS	DESCRIPTION
`dict`	Summary with keys: `tag`, `files` (list of dicts with name, size_kb, records), `total_papers`, `enriched_count`.

Datasets¶

Each release may contain these assets:

Asset	`which=`	Description
`all_papers_merged.json.gz`	`"merged"`	Deduplicated papers from all channels with extracted IDs
`enrich_checkpoint.json.gz`	`"enriched"`	Papers with metadata from OpenAlex + Semantic Scholar
`papers_enriched.json.gz`	`"enriched"`	Fully enriched papers (preferred over checkpoint)
`papers_final.json.gz`	`"final"`	Papers with embeddings, projections, and clusters
`channel_scrapes.tar.gz`	`"scrapes"`	Per-channel raw Slack scrapes (extracted to individual JSON files)

Storage¶

Downloaded files are stored at ~/.papertrail/data/{tag}/. Each release gets its own subdirectory, so multiple snapshots can coexist.

Compressed files (.json.gz, .tar.gz) are automatically decompressed after download. The originals are kept alongside the decompressed versions.

Authentication¶

Public repositories don't require authentication. For private repos or to avoid rate limits, set GITHUB_TOKEN or GH_TOKEN:

export GITHUB_TOKEN="ghp_..."

CLI Usage¶

# Download latest data
papertrail download

# Download specific release
papertrail download --tag v0.1.0-data-2026-04-04

# Download to custom directory
papertrail download --data-dir ./my_data

# List available releases
papertrail releases