Enricher API¶
The enricher module fetches paper metadata from OpenAlex and Semantic
Scholar. It tries multiple strategies (DOI, arXiv ID, PubMed ID, URL
analysis, title search) and merges results into a standardized
PaperMetadata dataclass.
Quick Example¶
from papertrail.enricher import PaperEnricher
enricher = PaperEnricher(
email="you@example.com", # enables OpenAlex polite pool (~10 req/s)
openalex_first=True, # try OpenAlex before Semantic Scholar
)
# Enrich a single paper by DOI
meta = enricher.enrich_by_doi("10.1038/nature12373")
print(meta.title, meta.authors, meta.year)
# Batch enrich from scraped data
papers = [
{"url": "https://arxiv.org/abs/2301.04821", "ids": {"arxiv_id": "2301.04821"}},
{"url": "https://doi.org/10.1103/PhysRevLett.123.021102", "ids": {"doi": "10.1103/PhysRevLett.123.021102"}},
]
enriched = enricher.enrich_papers(papers)
for p in enriched:
print(p["title"], p.get("abstract", "")[:80])
Data Classes¶
PaperMetadata
dataclass
¶
PaperMetadata(title: str, authors: list[str] = None, year: Optional[int] = None, journal: Optional[str] = None, abstract: Optional[str] = None, doi: Optional[str] = None, arxiv_id: Optional[str] = None, pubmed_id: Optional[str] = None, pmc_id: Optional[str] = None, openalex_id: Optional[str] = None, semantic_scholar_id: Optional[str] = None, cited_by_count: Optional[int] = None, institutions: list[str] = None, keywords: list[str] = None, url: Optional[str] = None)
Standardized paper metadata extracted from enrichment APIs.
| ATTRIBUTE | DESCRIPTION |
|---|---|
title |
Paper title.
TYPE:
|
authors |
List of author names.
TYPE:
|
year |
Publication year.
TYPE:
|
journal |
Journal or venue name.
TYPE:
|
abstract |
Paper abstract.
TYPE:
|
doi |
Digital Object Identifier.
TYPE:
|
arxiv_id |
arXiv identifier (e.g., "2301.04821").
TYPE:
|
pubmed_id |
PubMed ID.
TYPE:
|
pmc_id |
PubMed Central ID.
TYPE:
|
openalex_id |
OpenAlex work ID.
TYPE:
|
semantic_scholar_id |
Semantic Scholar paper ID.
TYPE:
|
institutions |
List of author institutions.
TYPE:
|
keywords |
Paper keywords or subjects.
TYPE:
|
url |
Primary paper URL.
TYPE:
|
PaperEnricher¶
PaperEnricher
¶
PaperEnricher(cache_size: int = 1000, timeout: int = DEFAULT_TIMEOUT, max_retries: int = MAX_RETRIES, user_agent: str | None = None, email: str | None = None, openalex_first: bool = True)
Comprehensive paper metadata enrichment using multiple APIs and strategies.
Provides methods to extract identifiers from various paper sources and enrich metadata using Semantic Scholar and OpenAlex APIs.
| PARAMETER | DESCRIPTION |
|---|---|
cache_size
|
Maximum number of API responses to cache (default: 1000).
TYPE:
|
timeout
|
Request timeout in seconds (default: 15).
TYPE:
|
max_retries
|
Maximum number of retries for failed requests (default: 3).
TYPE:
|
user_agent
|
User-Agent header for requests.
TYPE:
|
| ATTRIBUTE | DESCRIPTION |
|---|---|
_cache |
Simple LRU cache for API responses.
TYPE:
|
Examples:
Enrich by different identifiers:
>>> enricher = PaperEnricher()
>>> m1 = enricher.enrich_by_doi("10.1038/nature12373")
>>> m2 = enricher.enrich_by_arxiv_id("2301.04821")
>>> m3 = enricher.enrich_by_pmid("22460902")
>>> m4 = enricher.enrich_by_url("https://biorxiv.org/content/...")
Batch enrichment:
>>> papers = [
... {"url": "https://arxiv.org/abs/2301.04821"},
... {"doi": "10.1103/PhysRevLett.123.021102"},
... ]
>>> results = enricher.enrich_papers(papers)
Initialize the paper enricher.
| PARAMETER | DESCRIPTION |
|---|---|
email
|
Contact email for polite pool access on OpenAlex (~10x faster).
TYPE:
|
openalex_first
|
If True (default), try OpenAlex before Semantic Scholar. OpenAlex has much more generous rate limits (~10 req/s with email).
TYPE:
|
enrich_papers
¶
Enrich a batch of papers using multiple strategies.
For each paper, attempts enrichment in order: 1. By DOI if provided 2. By arXiv ID if provided 3. By PubMed ID if provided 4. By URL analysis 5. By title search (if no structured ID found)
| PARAMETER | DESCRIPTION |
|---|---|
papers
|
List of paper dictionaries with keys like: url, doi, arxiv_id, pubmed_id, title, etc.
TYPE:
|
require_title
|
If True, only include papers with non-empty titles in results.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
list[dict]
|
List of enriched paper dictionaries. Each includes all original fields plus enriched metadata (title, authors, year, etc.). |
Notes
This function applies exponential backoff between API calls to respect rate limits.
enrich_by_doi
¶
enrich_by_doi(doi: str) -> Optional[PaperMetadata]
Enrich a paper by DOI.
| PARAMETER | DESCRIPTION |
|---|---|
doi
|
Digital Object Identifier (e.g., "10.1038/nature12373").
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
(PaperMetadata, optional)
|
Enriched metadata, or None if not found. |
enrich_by_arxiv_id
¶
enrich_by_arxiv_id(arxiv_id: str) -> Optional[PaperMetadata]
Enrich a paper by arXiv identifier.
| PARAMETER | DESCRIPTION |
|---|---|
arxiv_id
|
arXiv ID (e.g., "2301.04821" or "2301.04821v1").
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
(PaperMetadata, optional)
|
Enriched metadata, or None if not found. |
enrich_by_pmid
¶
enrich_by_pmid(pmid: str) -> Optional[PaperMetadata]
Enrich a paper by PubMed ID.
| PARAMETER | DESCRIPTION |
|---|---|
pmid
|
PubMed ID (e.g., "22460902").
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
(PaperMetadata, optional)
|
Enriched metadata, or None if not found. |
enrich_by_title
¶
enrich_by_title(title: str) -> Optional[PaperMetadata]
Enrich a paper by title search.
Uses title-based search via Semantic Scholar API. This is less reliable than identifier-based search.
| PARAMETER | DESCRIPTION |
|---|---|
title
|
Paper title to search for.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
(PaperMetadata, optional)
|
Enriched metadata, or None if not found. |
enrich_by_title_openalex
¶
enrich_by_title_openalex(title: str) -> Optional[PaperMetadata]
Search OpenAlex by title (more reliable than S2 title search).
Uses fuzzy matching to verify the result matches the query.
| PARAMETER | DESCRIPTION |
|---|---|
title
|
Paper title to search for.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
(PaperMetadata, optional)
|
|
enrich_by_url
¶
enrich_by_url(url: str) -> Optional[PaperMetadata]
Enrich a paper by analyzing its URL and extracting identifiers.
Attempts to extract DOI, arXiv ID, or PubMed ID from the URL, then uses those for enrichment.
| PARAMETER | DESCRIPTION |
|---|---|
url
|
Paper URL.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
(PaperMetadata, optional)
|
Enriched metadata, or None if identifiers found and enrichment successful. |
Enrichment Strategy Order¶
When calling enrich_papers(), each paper is tried against these
strategies in sequence. The first one that returns a result wins:
- DOI →
enrich_by_doi()(OpenAlex first ifopenalex_first=True) - arXiv ID →
enrich_by_arxiv_id()(Semantic Scholar) - PubMed ID →
enrich_by_pmid()(PubMed E-utilities) - URL analysis →
enrich_by_url()(extracts IDs from URL, then retries above) - Title search (OpenAlex) →
enrich_by_title_openalex()(fuzzy match) - Title search (S2) →
enrich_by_title()(Semantic Scholar search)
Input Format¶
enrich_papers() accepts a list of dicts. Each dict can have any of:
| Key | Type | Description |
|---|---|---|
url |
str |
Paper URL (used for ID extraction and S2 URL lookup) |
doi |
str |
Digital Object Identifier |
arxiv_id |
str |
arXiv paper ID (e.g. "2301.04821") |
pubmed_id |
str |
PubMed ID |
title |
str |
Paper title (for title-based search fallback) |
ids |
dict |
Nested dict with doi, arxiv_id, pmid keys (from merge step) |
Output: PaperMetadata Fields¶
| Field | Type | Source |
|---|---|---|
title |
str |
All APIs |
authors |
list[str] |
All APIs — never truncated |
year |
int \| None |
All APIs |
journal |
str \| None |
Primary location / venue |
abstract |
str \| None |
S2, OpenAlex (reconstructed from inverted index) |
doi |
str \| None |
Normalized (no URL prefix) |
arxiv_id |
str \| None |
e.g. "2301.04821" |
pubmed_id |
str \| None |
Numeric string |
pmc_id |
str \| None |
e.g. "PMC12345" |
openalex_id |
str \| None |
e.g. "https://openalex.org/W1234567890" |
semantic_scholar_id |
str \| None |
S2 paper ID |
institutions |
list[str] |
OpenAlex authorships |
keywords |
list[str] |
Subject tags |
url |
str \| None |
Primary paper URL |
Rate Limiting¶
| API | Rate | Notes |
|---|---|---|
| OpenAlex (with email) | ~10 req/s | Set email param for polite pool |
| OpenAlex (without email) | ~1 req/s | Much slower |
| Semantic Scholar | ~3 req/s | Aggressive 429s, use 1.5s+ delay |
| PubMed E-utilities | ~3 req/s | NCBI standard limit |
Backward Compatibility¶
enrich_paper
¶
Enrich a paper URL with metadata (backward compatibility wrapper).
This is the old functional interface. For new code, use PaperEnricher class.