Enricher API¶

The enricher module fetches paper metadata from OpenAlex and Semantic Scholar. It tries multiple strategies (DOI, arXiv ID, PubMed ID, URL analysis, title search) and merges results into a standardized PaperMetadata dataclass.

Quick Example¶

from papertrail.enricher import PaperEnricher

enricher = PaperEnricher(
    email="you@example.com",  # enables OpenAlex polite pool (~10 req/s)
    openalex_first=True,       # try OpenAlex before Semantic Scholar
)

# Enrich a single paper by DOI
meta = enricher.enrich_by_doi("10.1038/nature12373")
print(meta.title, meta.authors, meta.year)

# Batch enrich from scraped data
papers = [
    {"url": "https://arxiv.org/abs/2301.04821", "ids": {"arxiv_id": "2301.04821"}},
    {"url": "https://doi.org/10.1103/PhysRevLett.123.021102", "ids": {"doi": "10.1103/PhysRevLett.123.021102"}},
]
enriched = enricher.enrich_papers(papers)
for p in enriched:
    print(p["title"], p.get("abstract", "")[:80])

Data Classes¶

PaperMetadata `dataclass` ¶

PaperMetadata(title: str, authors: list[str] = None, year: Optional[int] = None, journal: Optional[str] = None, abstract: Optional[str] = None, doi: Optional[str] = None, arxiv_id: Optional[str] = None, pubmed_id: Optional[str] = None, pmc_id: Optional[str] = None, openalex_id: Optional[str] = None, semantic_scholar_id: Optional[str] = None, cited_by_count: Optional[int] = None, institutions: list[str] = None, keywords: list[str] = None, url: Optional[str] = None)

Standardized paper metadata extracted from enrichment APIs.

ATTRIBUTE	DESCRIPTION
`title`	Paper title. TYPE: `str`
`authors`	List of author names. TYPE: `list[str]`
`year`	Publication year. TYPE: `(int, optional)`
`journal`	Journal or venue name. TYPE: `(str, optional)`
`abstract`	Paper abstract. TYPE: `(str, optional)`
`doi`	Digital Object Identifier. TYPE: `(str, optional)`
`arxiv_id`	arXiv identifier (e.g., "2301.04821"). TYPE: `(str, optional)`
`pubmed_id`	PubMed ID. TYPE: `(str, optional)`
`pmc_id`	PubMed Central ID. TYPE: `(str, optional)`
`openalex_id`	OpenAlex work ID. TYPE: `(str, optional)`
`semantic_scholar_id`	Semantic Scholar paper ID. TYPE: `(str, optional)`
`institutions`	List of author institutions. TYPE: `list[str]`
`keywords`	Paper keywords or subjects. TYPE: `list[str]`
`url`	Primary paper URL. TYPE: `(str, optional)`

__post_init__ ¶

__post_init__()

Initialize list fields.

to_dict ¶

to_dict() -> dict[str, Any]

Convert to dictionary representation.

RETURNS	DESCRIPTION
`dict`	Dictionary with all fields, excluding None values.

PaperEnricher¶

PaperEnricher ¶

PaperEnricher(cache_size: int = 1000, timeout: int = DEFAULT_TIMEOUT, max_retries: int = MAX_RETRIES, user_agent: str | None = None, email: str | None = None, openalex_first: bool = True)

Comprehensive paper metadata enrichment using multiple APIs and strategies.

Provides methods to extract identifiers from various paper sources and enrich metadata using Semantic Scholar and OpenAlex APIs.

PARAMETER	DESCRIPTION
`cache_size`	Maximum number of API responses to cache (default: 1000). TYPE: `int` DEFAULT: `1000`
`timeout`	Request timeout in seconds (default: 15). TYPE: `int` DEFAULT: `DEFAULT_TIMEOUT`
`max_retries`	Maximum number of retries for failed requests (default: 3). TYPE: `int` DEFAULT: `MAX_RETRIES`
`user_agent`	User-Agent header for requests. TYPE: `str` DEFAULT: `None`

ATTRIBUTE	DESCRIPTION
`_cache`	Simple LRU cache for API responses. TYPE: `dict`

Examples:

Enrich by different identifiers:

>>> enricher = PaperEnricher()
>>> m1 = enricher.enrich_by_doi("10.1038/nature12373")
>>> m2 = enricher.enrich_by_arxiv_id("2301.04821")
>>> m3 = enricher.enrich_by_pmid("22460902")
>>> m4 = enricher.enrich_by_url("https://biorxiv.org/content/...")

Batch enrichment:

>>> papers = [
...     {"url": "https://arxiv.org/abs/2301.04821"},
...     {"doi": "10.1103/PhysRevLett.123.021102"},
... ]
>>> results = enricher.enrich_papers(papers)

Initialize the paper enricher.

PARAMETER	DESCRIPTION
`email`	Contact email for polite pool access on OpenAlex (~10x faster). TYPE: `str` DEFAULT: `None`
`openalex_first`	If True (default), try OpenAlex before Semantic Scholar. OpenAlex has much more generous rate limits (~10 req/s with email). TYPE: `bool` DEFAULT: `True`

enrich_papers ¶

enrich_papers(papers: list[dict[str, Any]], require_title: bool = False) -> list[dict[str, Any]]

Enrich a batch of papers using multiple strategies.

For each paper, attempts enrichment in order: 1. By DOI if provided 2. By arXiv ID if provided 3. By PubMed ID if provided 4. By URL analysis 5. By title search (if no structured ID found)

PARAMETER	DESCRIPTION
`papers`	List of paper dictionaries with keys like: url, doi, arxiv_id, pubmed_id, title, etc. TYPE: `list[dict]`
`require_title`	If True, only include papers with non-empty titles in results. TYPE: `bool` DEFAULT: `False`

RETURNS	DESCRIPTION
`list[dict]`	List of enriched paper dictionaries. Each includes all original fields plus enriched metadata (title, authors, year, etc.).

Notes

This function applies exponential backoff between API calls to respect rate limits.

enrich_by_doi ¶

enrich_by_doi(doi: str) -> Optional[PaperMetadata]

Enrich a paper by DOI.

PARAMETER	DESCRIPTION
`doi`	Digital Object Identifier (e.g., "10.1038/nature12373"). TYPE: `str`

RETURNS	DESCRIPTION
`(PaperMetadata, optional)`	Enriched metadata, or None if not found.

enrich_by_arxiv_id ¶

enrich_by_arxiv_id(arxiv_id: str) -> Optional[PaperMetadata]

Enrich a paper by arXiv identifier.

PARAMETER	DESCRIPTION
`arxiv_id`	arXiv ID (e.g., "2301.04821" or "2301.04821v1"). TYPE: `str`

RETURNS	DESCRIPTION
`(PaperMetadata, optional)`	Enriched metadata, or None if not found.

enrich_by_pmid ¶

enrich_by_pmid(pmid: str) -> Optional[PaperMetadata]

Enrich a paper by PubMed ID.

PARAMETER	DESCRIPTION
`pmid`	PubMed ID (e.g., "22460902"). TYPE: `str`

RETURNS	DESCRIPTION
`(PaperMetadata, optional)`	Enriched metadata, or None if not found.

enrich_by_title ¶

enrich_by_title(title: str) -> Optional[PaperMetadata]

Enrich a paper by title search.

Uses title-based search via Semantic Scholar API. This is less reliable than identifier-based search.

PARAMETER	DESCRIPTION
`title`	Paper title to search for. TYPE: `str`

RETURNS	DESCRIPTION
`(PaperMetadata, optional)`	Enriched metadata, or None if not found.

enrich_by_title_openalex ¶

enrich_by_title_openalex(title: str) -> Optional[PaperMetadata]

Search OpenAlex by title (more reliable than S2 title search).

Uses fuzzy matching to verify the result matches the query.

PARAMETER	DESCRIPTION
`title`	Paper title to search for. TYPE: `str`

RETURNS	DESCRIPTION
`(PaperMetadata, optional)`

enrich_by_url ¶

enrich_by_url(url: str) -> Optional[PaperMetadata]

Enrich a paper by analyzing its URL and extracting identifiers.

Attempts to extract DOI, arXiv ID, or PubMed ID from the URL, then uses those for enrichment.

PARAMETER	DESCRIPTION
`url`	Paper URL. TYPE: `str`

RETURNS	DESCRIPTION
`(PaperMetadata, optional)`	Enriched metadata, or None if identifiers found and enrichment successful.

Enrichment Strategy Order¶

When calling enrich_papers(), each paper is tried against these strategies in sequence. The first one that returns a result wins:

DOI → enrich_by_doi() (OpenAlex first if openalex_first=True)
arXiv ID → enrich_by_arxiv_id() (Semantic Scholar)
PubMed ID → enrich_by_pmid() (PubMed E-utilities)
URL analysis → enrich_by_url() (extracts IDs from URL, then retries above)
Title search (OpenAlex) → enrich_by_title_openalex() (fuzzy match)
Title search (S2) → enrich_by_title() (Semantic Scholar search)

Input Format¶

enrich_papers() accepts a list of dicts. Each dict can have any of:

Key	Type	Description
`url`	`str`	Paper URL (used for ID extraction and S2 URL lookup)
`doi`	`str`	Digital Object Identifier
`arxiv_id`	`str`	arXiv paper ID (e.g. `"2301.04821"`)
`pubmed_id`	`str`	PubMed ID
`title`	`str`	Paper title (for title-based search fallback)
`ids`	`dict`	Nested dict with `doi`, `arxiv_id`, `pmid` keys (from merge step)

Output: PaperMetadata Fields¶

Field	Type	Source
`title`	`str`	All APIs
`authors`	`list[str]`	All APIs — never truncated
`year`	`int \\| None`	All APIs
`journal`	`str \\| None`	Primary location / venue
`abstract`	`str \\| None`	S2, OpenAlex (reconstructed from inverted index)
`doi`	`str \\| None`	Normalized (no URL prefix)
`arxiv_id`	`str \\| None`	e.g. `"2301.04821"`
`pubmed_id`	`str \\| None`	Numeric string
`pmc_id`	`str \\| None`	e.g. `"PMC12345"`
`openalex_id`	`str \\| None`	e.g. `"https://openalex.org/W1234567890"`
`semantic_scholar_id`	`str \\| None`	S2 paper ID
`institutions`	`list[str]`	OpenAlex authorships
`keywords`	`list[str]`	Subject tags
`url`	`str \\| None`	Primary paper URL

Rate Limiting¶

API	Rate	Notes
OpenAlex (with email)	~10 req/s	Set `email` param for polite pool
OpenAlex (without email)	~1 req/s	Much slower
Semantic Scholar	~3 req/s	Aggressive 429s, use 1.5s+ delay
PubMed E-utilities	~3 req/s	NCBI standard limit

Backward Compatibility¶

enrich_paper ¶

enrich_paper(url: str, timeout: int = DEFAULT_TIMEOUT) -> dict[str, Any]

Enrich a paper URL with metadata (backward compatibility wrapper).

This is the old functional interface. For new code, use PaperEnricher class.