Skip to content

Enricher API

The enricher module fetches paper metadata from OpenAlex and Semantic Scholar. It tries multiple strategies (DOI, arXiv ID, PubMed ID, URL analysis, title search) and merges results into a standardized PaperMetadata dataclass.

Quick Example

from papertrail.enricher import PaperEnricher

enricher = PaperEnricher(
    email="you@example.com",  # enables OpenAlex polite pool (~10 req/s)
    openalex_first=True,       # try OpenAlex before Semantic Scholar
)

# Enrich a single paper by DOI
meta = enricher.enrich_by_doi("10.1038/nature12373")
print(meta.title, meta.authors, meta.year)

# Batch enrich from scraped data
papers = [
    {"url": "https://arxiv.org/abs/2301.04821", "ids": {"arxiv_id": "2301.04821"}},
    {"url": "https://doi.org/10.1103/PhysRevLett.123.021102", "ids": {"doi": "10.1103/PhysRevLett.123.021102"}},
]
enriched = enricher.enrich_papers(papers)
for p in enriched:
    print(p["title"], p.get("abstract", "")[:80])

Data Classes

PaperMetadata dataclass

PaperMetadata(title: str, authors: list[str] = None, year: Optional[int] = None, journal: Optional[str] = None, abstract: Optional[str] = None, doi: Optional[str] = None, arxiv_id: Optional[str] = None, pubmed_id: Optional[str] = None, pmc_id: Optional[str] = None, openalex_id: Optional[str] = None, semantic_scholar_id: Optional[str] = None, cited_by_count: Optional[int] = None, institutions: list[str] = None, keywords: list[str] = None, url: Optional[str] = None)

Standardized paper metadata extracted from enrichment APIs.

ATTRIBUTE DESCRIPTION
title

Paper title.

TYPE: str

authors

List of author names.

TYPE: list[str]

year

Publication year.

TYPE: (int, optional)

journal

Journal or venue name.

TYPE: (str, optional)

abstract

Paper abstract.

TYPE: (str, optional)

doi

Digital Object Identifier.

TYPE: (str, optional)

arxiv_id

arXiv identifier (e.g., "2301.04821").

TYPE: (str, optional)

pubmed_id

PubMed ID.

TYPE: (str, optional)

pmc_id

PubMed Central ID.

TYPE: (str, optional)

openalex_id

OpenAlex work ID.

TYPE: (str, optional)

semantic_scholar_id

Semantic Scholar paper ID.

TYPE: (str, optional)

institutions

List of author institutions.

TYPE: list[str]

keywords

Paper keywords or subjects.

TYPE: list[str]

url

Primary paper URL.

TYPE: (str, optional)

__post_init__

__post_init__()

Initialize list fields.

to_dict

to_dict() -> dict[str, Any]

Convert to dictionary representation.

RETURNS DESCRIPTION
dict

Dictionary with all fields, excluding None values.

PaperEnricher

PaperEnricher

PaperEnricher(cache_size: int = 1000, timeout: int = DEFAULT_TIMEOUT, max_retries: int = MAX_RETRIES, user_agent: str | None = None, email: str | None = None, openalex_first: bool = True)

Comprehensive paper metadata enrichment using multiple APIs and strategies.

Provides methods to extract identifiers from various paper sources and enrich metadata using Semantic Scholar and OpenAlex APIs.

PARAMETER DESCRIPTION
cache_size

Maximum number of API responses to cache (default: 1000).

TYPE: int DEFAULT: 1000

timeout

Request timeout in seconds (default: 15).

TYPE: int DEFAULT: DEFAULT_TIMEOUT

max_retries

Maximum number of retries for failed requests (default: 3).

TYPE: int DEFAULT: MAX_RETRIES

user_agent

User-Agent header for requests.

TYPE: str DEFAULT: None

ATTRIBUTE DESCRIPTION
_cache

Simple LRU cache for API responses.

TYPE: dict

Examples:

Enrich by different identifiers:

>>> enricher = PaperEnricher()
>>> m1 = enricher.enrich_by_doi("10.1038/nature12373")
>>> m2 = enricher.enrich_by_arxiv_id("2301.04821")
>>> m3 = enricher.enrich_by_pmid("22460902")
>>> m4 = enricher.enrich_by_url("https://biorxiv.org/content/...")

Batch enrichment:

>>> papers = [
...     {"url": "https://arxiv.org/abs/2301.04821"},
...     {"doi": "10.1103/PhysRevLett.123.021102"},
... ]
>>> results = enricher.enrich_papers(papers)

Initialize the paper enricher.

PARAMETER DESCRIPTION
email

Contact email for polite pool access on OpenAlex (~10x faster).

TYPE: str DEFAULT: None

openalex_first

If True (default), try OpenAlex before Semantic Scholar. OpenAlex has much more generous rate limits (~10 req/s with email).

TYPE: bool DEFAULT: True

enrich_papers

enrich_papers(papers: list[dict[str, Any]], require_title: bool = False) -> list[dict[str, Any]]

Enrich a batch of papers using multiple strategies.

For each paper, attempts enrichment in order: 1. By DOI if provided 2. By arXiv ID if provided 3. By PubMed ID if provided 4. By URL analysis 5. By title search (if no structured ID found)

PARAMETER DESCRIPTION
papers

List of paper dictionaries with keys like: url, doi, arxiv_id, pubmed_id, title, etc.

TYPE: list[dict]

require_title

If True, only include papers with non-empty titles in results.

TYPE: bool DEFAULT: False

RETURNS DESCRIPTION
list[dict]

List of enriched paper dictionaries. Each includes all original fields plus enriched metadata (title, authors, year, etc.).

Notes

This function applies exponential backoff between API calls to respect rate limits.

enrich_by_doi

enrich_by_doi(doi: str) -> Optional[PaperMetadata]

Enrich a paper by DOI.

PARAMETER DESCRIPTION
doi

Digital Object Identifier (e.g., "10.1038/nature12373").

TYPE: str

RETURNS DESCRIPTION
(PaperMetadata, optional)

Enriched metadata, or None if not found.

enrich_by_arxiv_id

enrich_by_arxiv_id(arxiv_id: str) -> Optional[PaperMetadata]

Enrich a paper by arXiv identifier.

PARAMETER DESCRIPTION
arxiv_id

arXiv ID (e.g., "2301.04821" or "2301.04821v1").

TYPE: str

RETURNS DESCRIPTION
(PaperMetadata, optional)

Enriched metadata, or None if not found.

enrich_by_pmid

enrich_by_pmid(pmid: str) -> Optional[PaperMetadata]

Enrich a paper by PubMed ID.

PARAMETER DESCRIPTION
pmid

PubMed ID (e.g., "22460902").

TYPE: str

RETURNS DESCRIPTION
(PaperMetadata, optional)

Enriched metadata, or None if not found.

enrich_by_title

enrich_by_title(title: str) -> Optional[PaperMetadata]

Enrich a paper by title search.

Uses title-based search via Semantic Scholar API. This is less reliable than identifier-based search.

PARAMETER DESCRIPTION
title

Paper title to search for.

TYPE: str

RETURNS DESCRIPTION
(PaperMetadata, optional)

Enriched metadata, or None if not found.

enrich_by_title_openalex

enrich_by_title_openalex(title: str) -> Optional[PaperMetadata]

Search OpenAlex by title (more reliable than S2 title search).

Uses fuzzy matching to verify the result matches the query.

PARAMETER DESCRIPTION
title

Paper title to search for.

TYPE: str

RETURNS DESCRIPTION
(PaperMetadata, optional)

enrich_by_url

enrich_by_url(url: str) -> Optional[PaperMetadata]

Enrich a paper by analyzing its URL and extracting identifiers.

Attempts to extract DOI, arXiv ID, or PubMed ID from the URL, then uses those for enrichment.

PARAMETER DESCRIPTION
url

Paper URL.

TYPE: str

RETURNS DESCRIPTION
(PaperMetadata, optional)

Enriched metadata, or None if identifiers found and enrichment successful.

Enrichment Strategy Order

When calling enrich_papers(), each paper is tried against these strategies in sequence. The first one that returns a result wins:

  1. DOIenrich_by_doi() (OpenAlex first if openalex_first=True)
  2. arXiv IDenrich_by_arxiv_id() (Semantic Scholar)
  3. PubMed IDenrich_by_pmid() (PubMed E-utilities)
  4. URL analysisenrich_by_url() (extracts IDs from URL, then retries above)
  5. Title search (OpenAlex)enrich_by_title_openalex() (fuzzy match)
  6. Title search (S2)enrich_by_title() (Semantic Scholar search)

Input Format

enrich_papers() accepts a list of dicts. Each dict can have any of:

Key Type Description
url str Paper URL (used for ID extraction and S2 URL lookup)
doi str Digital Object Identifier
arxiv_id str arXiv paper ID (e.g. "2301.04821")
pubmed_id str PubMed ID
title str Paper title (for title-based search fallback)
ids dict Nested dict with doi, arxiv_id, pmid keys (from merge step)

Output: PaperMetadata Fields

Field Type Source
title str All APIs
authors list[str] All APIs — never truncated
year int \| None All APIs
journal str \| None Primary location / venue
abstract str \| None S2, OpenAlex (reconstructed from inverted index)
doi str \| None Normalized (no URL prefix)
arxiv_id str \| None e.g. "2301.04821"
pubmed_id str \| None Numeric string
pmc_id str \| None e.g. "PMC12345"
openalex_id str \| None e.g. "https://openalex.org/W1234567890"
semantic_scholar_id str \| None S2 paper ID
institutions list[str] OpenAlex authorships
keywords list[str] Subject tags
url str \| None Primary paper URL

Rate Limiting

API Rate Notes
OpenAlex (with email) ~10 req/s Set email param for polite pool
OpenAlex (without email) ~1 req/s Much slower
Semantic Scholar ~3 req/s Aggressive 429s, use 1.5s+ delay
PubMed E-utilities ~3 req/s NCBI standard limit

Backward Compatibility

enrich_paper

enrich_paper(url: str, timeout: int = DEFAULT_TIMEOUT) -> dict[str, Any]

Enrich a paper URL with metadata (backward compatibility wrapper).

This is the old functional interface. For new code, use PaperEnricher class.