Enriching Metadata¶
The enricher fetches rich metadata for each paper using Semantic Scholar and OpenAlex APIs, populating title, authors, abstract, journal, and more.
How It Works¶
The enricher:
- Identifies papers using DOI, arXiv ID, or PubMed ID
- Queries Semantic Scholar API first (comprehensive academic coverage)
- Falls back to OpenAlex API if Semantic Scholar doesn't have it
- Extracts metadata: title, authors, abstract, journal, year, citation count, institutions
- Handles failures gracefully (missing papers get null fields)
- Exports enriched JSON with all fields populated
Basic Usage¶
Enrich Papers¶
Takes the raw JSON from the scraper and populates metadata fields.
Dry Run¶
Preview enrichment without saving:
Verbose Output¶
See which papers are being enriched:
Output Format¶
Enriched papers include these additional fields:
{
"papers": [
{
"doi": "10.1038/nature12373",
"title": "Deep residual learning for image recognition",
"authors": [
{
"name": "Kaiming He",
"affiliation": "Microsoft Research"
},
{
"name": "Xiangyu Zhang",
"affiliation": "Microsoft Research"
}
],
"abstract": "Deeper neural networks are more difficult to train...",
"journal": "Nature",
"year": 2015,
"citation_count": 85432,
"is_open_access": true,
"source_type": "doi",
"url": "https://doi.org/10.1038/nature12373",
"channel": "general",
"user": "U123456",
"timestamp": 1234567890,
"reactions": {"thumbsup": 5},
"engagement_score": 0.90
}
]
}
New Fields¶
| Field | Type | Description |
|---|---|---|
title |
string | Paper title |
authors |
array | Author objects with name and affiliation |
abstract |
string | Full abstract text |
journal |
string | Journal name |
year |
integer | Publication year |
citation_count |
integer | Number of citations |
is_open_access |
boolean | Whether paper is open access |
semantic_scholar_id |
string | Semantic Scholar paper ID |
openalex_id |
string | OpenAlex paper ID |
Advanced Options¶
Batch Size¶
Control how many papers are enriched in parallel:
# Enrich 50 papers at a time (default: 10)
papertrail enrich papers_raw.json -o papers_enriched.json --batch-size 50
Delay Between Requests¶
Add delays to avoid rate limiting:
# 0.5 second delay between API calls
papertrail enrich papers_raw.json -o papers_enriched.json --delay 0.5
Skip Missing Papers¶
By default, papers without a DOI/arXiv ID are still processed (just won't be enriched). To skip them:
Cache Results¶
Enrichment is automatically cached. To clear cache and re-enrich:
Only Enrich Specific Fields¶
Focus on a subset of metadata:
Handling Issues¶
Papers Without Metadata¶
If a paper isn't found, the enricher:
- Logs a warning
- Leaves fields as
nullor empty arrays - Continues with other papers
You can filter these out later:
import json
with open("papers_enriched.json") as f:
data = json.load(f)
# Keep only papers with abstracts
with_metadata = [p for p in data["papers"] if p.get("abstract")]
print(f"Papers with metadata: {len(with_metadata)}/{len(data['papers'])}")
data["papers"] = with_metadata
with open("papers_enriched_filtered.json", "w") as f:
json.dump(data, f, indent=2)
Rate Limiting¶
Semantic Scholar and OpenAlex have rate limits:
- Semantic Scholar: ~100 requests/second
- OpenAlex: ~10 requests/second
If you hit limits, PaperTrail will:
- Automatically wait and retry
- Show a warning
- Continue with other papers
To be conservative:
Missing Authors or Institutions¶
Some papers may have partial author information. This is normal for:
- Very old papers
- Non-traditional publications
- Papers with privacy restrictions
Open Access Detection¶
Open access detection relies on external APIs. It's not always accurate. Use for general reference only.
Python API¶
Enrich papers programmatically:
from papertrail.enricher import Enricher
# Create enricher
enricher = Enricher(verbose=True)
# Enrich papers (from scraper)
papers = enricher.enrich(
papers,
batch_size=20,
delay=0.5
)
# Access enriched data
for paper in papers:
if paper.get("abstract"):
print(f"{paper['title']}")
print(f" Authors: {', '.join(a['name'] for a in paper['authors'][:3])}")
print(f" Citations: {paper['citation_count']}")
Tips & Tricks¶
Find Most Cited Papers¶
import json
from operator import itemgetter
with open("papers_enriched.json") as f:
papers = json.load(f)["papers"]
# Sort by citation count
sorted_papers = sorted(
papers,
key=lambda p: p.get("citation_count", 0),
reverse=True
)
print("Most cited papers:")
for paper in sorted_papers[:10]:
print(f" {paper['citation_count']:5d} {paper['title']}")
Find Papers by Author¶
import json
author_query = "Marie Curie"
with open("papers_enriched.json") as f:
papers = json.load(f)["papers"]
for paper in papers:
for author in paper.get("authors", []):
if author_query.lower() in author.get("name", "").lower():
print(f"{paper['title']}")
break
Extract Institutions¶
import json
from collections import Counter
with open("papers_enriched.json") as f:
papers = json.load(f)["papers"]
# Find most common institutions
institutions = []
for paper in papers:
for author in paper.get("authors", []):
if author.get("affiliation"):
institutions.append(author["affiliation"])
counter = Counter(institutions)
print("Most common institutions:")
for inst, count in counter.most_common(10):
print(f" {count:3d} {inst}")
Find Open Access Papers¶
import json
with open("papers_enriched.json") as f:
papers = json.load(f)["papers"]
open_access = [p for p in papers if p.get("is_open_access")]
print(f"Open access: {len(open_access)}/{len(papers)} ({100*len(open_access)/len(papers):.1f}%)")
Enrichment Sources¶
Semantic Scholar¶
- API: api.semanticscholar.org
- Coverage: Very broad (60M+ papers)
- Fields: Title, authors, abstract, journal, year, citations, open access
- Rate limit: ~100 req/sec (free tier)
OpenAlex¶
- API: api.openalex.org
- Coverage: Broad (~200M works)
- Fields: Title, authors, abstract, journal, year, citations, concepts
- Rate limit: ~10 req/sec (no auth needed)
Both APIs are free and don't require API keys. Authentication can increase rate limits.
Next Steps¶
- Computing Embeddings — Create semantic representations
- Building the Dashboard — Visualize papers
- API Reference: Enricher — Detailed Python API