Skip to content

Data API

The data module downloads and manages pre-scraped paper datasets from PaperTrail GitHub Releases. This lets you skip the Slack scraping step and jump straight to enrichment, embedding, or analysis.

Quick Example

from papertrail.data import download_release, load_papers, list_releases

# See what's available
for r in list_releases():
    print(r["tag"], r["date"], len(r["assets"]), "assets")

# Download the latest release (auto-detected)
data_dir = download_release()
# → ~/.papertrail/data/v0.1.0-data-2026-04-04/

# Load papers
papers = load_papers()
print(f"{len(papers)} papers loaded")

# Load a specific dataset
enriched = load_papers(which="enriched")
scrapes = load_papers(which="scrapes")  # dict of {filename: data}

Functions

list_releases

list_releases(repo: str = GITHUB_REPO, data_only: bool = True) -> list[dict[str, Any]]

List available data releases from the GitHub repository.

PARAMETER DESCRIPTION
repo

GitHub repository in owner/name format.

TYPE: str DEFAULT: GITHUB_REPO

data_only

If True, only return releases whose tag contains data (filters out code-only releases).

TYPE: bool DEFAULT: True

RETURNS DESCRIPTION
list[dict]

Each dict has keys: tag, date, description, assets (list of filenames), url (HTML URL).

get_latest_release

get_latest_release(repo: str = GITHUB_REPO) -> Optional[dict[str, Any]]

Get the most recent data release.

PARAMETER DESCRIPTION
repo

GitHub repository in owner/name format.

TYPE: str DEFAULT: GITHUB_REPO

RETURNS DESCRIPTION
dict or None

Release info dict (same format as list_releases), or None if no data releases exist.

download_release

download_release(tag: Optional[str] = None, data_dir: Optional[str | Path] = None, repo: str = GITHUB_REPO, assets: Optional[list[str]] = None, force: bool = False) -> Path

Download release assets from GitHub and decompress them.

PARAMETER DESCRIPTION
tag

Release tag (e.g. "v0.1.0-data-2026-04-04"). If None, downloads the latest data release.

TYPE: str DEFAULT: None

data_dir

Directory to store downloaded files. Defaults to ~/.papertrail/data/{tag}/.

TYPE: str or Path DEFAULT: None

repo

GitHub repository in owner/name format.

TYPE: str DEFAULT: GITHUB_REPO

assets

Specific asset filenames to download. If None, downloads all assets in the release.

TYPE: list[str] DEFAULT: None

force

If True, re-download even if files already exist locally.

TYPE: bool DEFAULT: False

RETURNS DESCRIPTION
Path

Directory containing the downloaded (and decompressed) files.

RAISES DESCRIPTION
ValueError

If no matching release is found.

HTTPError

If a download fails.

load_papers

load_papers(data_dir: Optional[str | Path] = None, tag: Optional[str] = None, which: str = 'merged') -> list[dict[str, Any]]

Load papers from a downloaded release.

PARAMETER DESCRIPTION
data_dir

Directory containing downloaded data files. If None, uses ~/.papertrail/data/{tag}/.

TYPE: str or Path DEFAULT: None

tag

Release tag. Used to locate the data directory if data_dir is not specified. If both are None, uses the latest release.

TYPE: str DEFAULT: None

which

Which dataset to load:

  • "merged" — raw merged papers (all_papers_merged.json)
  • "enriched" — papers with metadata (enrich_checkpoint.json or papers_enriched.json)
  • "final" — papers with embeddings (papers_final.json)
  • "scrapes" — per-channel scrapes (returns a dict of {filename: data})

TYPE: str DEFAULT: 'merged'

RETURNS DESCRIPTION
list[dict] or dict

Paper records. For which="scrapes", returns a dict mapping filenames to their contents.

RAISES DESCRIPTION
FileNotFoundError

If the data directory or expected files don't exist. Suggests running download_release() first.

data_summary

data_summary(data_dir: Optional[str | Path] = None, tag: Optional[str] = None) -> dict[str, Any]

Summarize available data in a release directory.

PARAMETER DESCRIPTION
data_dir

Directory to inspect.

TYPE: str or Path DEFAULT: None

tag

Release tag (used to locate directory if data_dir is None).

TYPE: str DEFAULT: None

RETURNS DESCRIPTION
dict

Summary with keys: tag, files (list of dicts with name, size_kb, records), total_papers, enriched_count.

Datasets

Each release may contain these assets:

Asset which= Description
all_papers_merged.json.gz "merged" Deduplicated papers from all channels with extracted IDs
enrich_checkpoint.json.gz "enriched" Papers with metadata from OpenAlex + Semantic Scholar
papers_enriched.json.gz "enriched" Fully enriched papers (preferred over checkpoint)
papers_final.json.gz "final" Papers with embeddings, projections, and clusters
channel_scrapes.tar.gz "scrapes" Per-channel raw Slack scrapes (extracted to individual JSON files)

Storage

Downloaded files are stored at ~/.papertrail/data/{tag}/. Each release gets its own subdirectory, so multiple snapshots can coexist.

Compressed files (.json.gz, .tar.gz) are automatically decompressed after download. The originals are kept alongside the decompressed versions.

Authentication

Public repositories don't require authentication. For private repos or to avoid rate limits, set GITHUB_TOKEN or GH_TOKEN:

export GITHUB_TOKEN="ghp_..."

CLI Usage

# Download latest data
papertrail download

# Download specific release
papertrail download --tag v0.1.0-data-2026-04-04

# Download to custom directory
papertrail download --data-dir ./my_data

# List available releases
papertrail releases