Data API¶
The data module downloads and manages pre-scraped paper datasets from PaperTrail GitHub Releases. This lets you skip the Slack scraping step and jump straight to enrichment, embedding, or analysis.
Quick Example¶
from papertrail.data import download_release, load_papers, list_releases
# See what's available
for r in list_releases():
print(r["tag"], r["date"], len(r["assets"]), "assets")
# Download the latest release (auto-detected)
data_dir = download_release()
# → ~/.papertrail/data/v0.1.0-data-2026-04-04/
# Load papers
papers = load_papers()
print(f"{len(papers)} papers loaded")
# Load a specific dataset
enriched = load_papers(which="enriched")
scrapes = load_papers(which="scrapes") # dict of {filename: data}
Functions¶
list_releases
¶
List available data releases from the GitHub repository.
| PARAMETER | DESCRIPTION |
|---|---|
repo
|
GitHub repository in
TYPE:
|
data_only
|
If True, only return releases whose tag contains
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
list[dict]
|
Each dict has keys: |
get_latest_release
¶
Get the most recent data release.
| PARAMETER | DESCRIPTION |
|---|---|
repo
|
GitHub repository in
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
dict or None
|
Release info dict (same format as |
download_release
¶
download_release(tag: Optional[str] = None, data_dir: Optional[str | Path] = None, repo: str = GITHUB_REPO, assets: Optional[list[str]] = None, force: bool = False) -> Path
Download release assets from GitHub and decompress them.
| PARAMETER | DESCRIPTION |
|---|---|
tag
|
Release tag (e.g.
TYPE:
|
data_dir
|
Directory to store downloaded files. Defaults to
TYPE:
|
repo
|
GitHub repository in
TYPE:
|
assets
|
Specific asset filenames to download. If None, downloads all assets in the release.
TYPE:
|
force
|
If True, re-download even if files already exist locally.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
Path
|
Directory containing the downloaded (and decompressed) files. |
| RAISES | DESCRIPTION |
|---|---|
ValueError
|
If no matching release is found. |
HTTPError
|
If a download fails. |
load_papers
¶
load_papers(data_dir: Optional[str | Path] = None, tag: Optional[str] = None, which: str = 'merged') -> list[dict[str, Any]]
Load papers from a downloaded release.
| PARAMETER | DESCRIPTION |
|---|---|
data_dir
|
Directory containing downloaded data files. If None, uses
TYPE:
|
tag
|
Release tag. Used to locate the data directory if
TYPE:
|
which
|
Which dataset to load:
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
list[dict] or dict
|
Paper records. For |
| RAISES | DESCRIPTION |
|---|---|
FileNotFoundError
|
If the data directory or expected files don't exist.
Suggests running |
data_summary
¶
Summarize available data in a release directory.
| PARAMETER | DESCRIPTION |
|---|---|
data_dir
|
Directory to inspect.
TYPE:
|
tag
|
Release tag (used to locate directory if data_dir is None).
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
dict
|
Summary with keys: |
Datasets¶
Each release may contain these assets:
| Asset | which= |
Description |
|---|---|---|
all_papers_merged.json.gz |
"merged" |
Deduplicated papers from all channels with extracted IDs |
enrich_checkpoint.json.gz |
"enriched" |
Papers with metadata from OpenAlex + Semantic Scholar |
papers_enriched.json.gz |
"enriched" |
Fully enriched papers (preferred over checkpoint) |
papers_final.json.gz |
"final" |
Papers with embeddings, projections, and clusters |
channel_scrapes.tar.gz |
"scrapes" |
Per-channel raw Slack scrapes (extracted to individual JSON files) |
Storage¶
Downloaded files are stored at ~/.papertrail/data/{tag}/. Each release
gets its own subdirectory, so multiple snapshots can coexist.
Compressed files (.json.gz, .tar.gz) are automatically decompressed
after download. The originals are kept alongside the decompressed versions.
Authentication¶
Public repositories don't require authentication. For private repos or to
avoid rate limits, set GITHUB_TOKEN or GH_TOKEN: