Scraper API¶
The scraper module discovers academic papers shared in Slack channels. It handles full pagination, URL extraction from Slack message formatting, domain-based filtering across 30+ academic publishers, and URL normalization for deduplication.
Quick Example¶
from papertrail.scraper import SlackPaperScraper
scraper = SlackPaperScraper(token="xoxb-...")
papers = scraper.scrape_channel("C0123Q7PGGP")
print(f"Found {len(papers)} papers")
# Extract URLs from raw text
urls = scraper.extract_paper_urls(["Check out https://arxiv.org/abs/2301.04821"])
# → ['https://arxiv.org/abs/2301.04821']
Constants¶
PAPER_DOMAINS
module-attribute
¶
PAPER_DOMAINS = {'arxiv.org', 'biorxiv.org', 'medrxiv.org', 'psyarxiv.org', 'eartharxiv.org', 'ecoevo.org', 'doi.org', 'nature.com', 'science.org', 'cell.com', 'pnas.org', 'springer.com', 'wiley.com', 'elsevier.com', 'academic.oup.com', 'tandfonline.com', 'jstor.org', 'plos.org', 'elifesciences.org', 'pubmed.ncbi.nlm.nih.gov', 'pmc.ncbi.nlm.nih.gov', 'ncbi.nlm.nih.gov', 'researchgate.net', 'academia.edu', 'github.com', 'ssrn.com', 'paperswitchcode.com', 'openreview.net', 'arxiv-vanity.com'}
EXCLUDE_PATTERNS
module-attribute
¶
EXCLUDE_PATTERNS = ['slack\\.com', 'youtu(?:\\.be|be\\.com)', 'twitter\\.com|x\\.com', 'github\\.com/?$', 'imgur\\.com', 'imgur\\.com/\\w{5,7}$']
Data Classes¶
SlackPaper
dataclass
¶
SlackPaper(channel_id: str, channel_name: str, shared_by: str, user_id: str, timestamp: str, message_ts: str, permalink: str, message_text: str, paper_url: str, reactions_count: int = 0, reply_count: int = 0, reaction_details: dict[str, int] = dict())
Represents a paper shared in a Slack message.
| ATTRIBUTE | DESCRIPTION |
|---|---|
channel_id |
Slack channel ID where paper was shared.
TYPE:
|
channel_name |
Human-readable channel name.
TYPE:
|
shared_by |
Display name of user who shared the paper.
TYPE:
|
user_id |
Slack user ID of the person who shared.
TYPE:
|
timestamp |
ISO format timestamp when paper was shared.
TYPE:
|
message_ts |
Slack message timestamp (for accessing thread).
TYPE:
|
permalink |
Direct link to the Slack message.
TYPE:
|
message_text |
Full message text where paper was mentioned.
TYPE:
|
paper_url |
Extracted and normalized paper URL.
TYPE:
|
reactions_count |
Total number of emoji reactions on this message.
TYPE:
|
reply_count |
Number of replies in the thread.
TYPE:
|
reaction_details |
Mapping of emoji names to reaction counts.
TYPE:
|
SlackPaperScraper¶
SlackPaperScraper
¶
SlackPaperScraper(token: str | None = None, channels: list[str] | None = None, search_queries: list[str] | None = None, rate_limit_delay: float = 0.3, use_mcp: bool = False, custom_domains: set[str] | None = None)
Comprehensive Slack paper scraper with pagination, filtering, and enrichment.
This is the main scraper class. For backward compatibility with older code, the original SlackScraper class is still available as an alias.
Supports two modes of operation: 1. Direct Slack API access via slack_sdk.WebClient (recommended) 2. MCP tool integration for remote/constrained environments
| PARAMETER | DESCRIPTION |
|---|---|
token
|
Slack Bot Token (xoxb-...). Required for direct API mode.
TYPE:
|
channels
|
List of channel IDs or names to scrape. If None, will scrape all channels accessible to the bot token.
TYPE:
|
search_queries
|
Custom Slack search queries to use. Defaults to searching common paper domains.
TYPE:
|
rate_limit_delay
|
Delay in seconds between API calls (default: 0.3 seconds).
TYPE:
|
use_mcp
|
If True, attempt to use MCP tools instead of direct Slack API.
TYPE:
|
| ATTRIBUTE | DESCRIPTION |
|---|---|
BASE_URL |
Slack API base URL.
TYPE:
|
Examples:
Scrape specific channel:
>>> scraper = SlackPaperScraper(token="xoxb-...")
>>> papers = scraper.scrape_channel("C123456789")
>>> print(f"Found {len(papers)} papers")
Extract URLs from message text:
>>> messages = scraper.scrape_channel("C123456789")
>>> texts = [m.message_text for m in messages]
>>> urls = scraper.extract_paper_urls(texts)
>>> print(f"Paper URLs: {urls}")
Normalize URLs for deduplication:
>>> url1 = "https://doi.org/10.1234/example"
>>> url2 = "https://example.com?utm_source=slack"
>>> norm1 = scraper.normalize_url(url1)
>>> norm2 = scraper.normalize_url(url2)
Initialize the Slack paper scraper.
scrape_channel
¶
scrape_channel(channel_id: str, oldest: str | float | None = None, latest: str | float | None = None, include_replies: bool = False) -> list[SlackPaper]
Scrape all messages from a channel with full pagination.
Fetches complete message history with automatic pagination and optional engagement metrics (reactions, thread replies).
| PARAMETER | DESCRIPTION |
|---|---|
channel_id
|
The channel ID to scrape (must start with 'C').
TYPE:
|
oldest
|
Oldest message timestamp to include (Unix timestamp or ISO format string). If None, starts from channel creation.
TYPE:
|
latest
|
Newest message timestamp to include (Unix timestamp or ISO format string). If None, includes through current time.
TYPE:
|
include_replies
|
If True, fetch and include thread reply counts and reaction details (slower, more API calls).
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
list[SlackPaper]
|
List of papers found in the channel, deduplicated by URL. Sorted by most recent first. |
Notes
This method automatically handles pagination. The Slack API returns messages in batches, and this method continues fetching until all messages are retrieved.
extract_paper_urls
¶
Extract paper URLs from a list of text strings.
Searches for URLs containing known paper domains and filters out common false positives (images, social media, etc.).
| PARAMETER | DESCRIPTION |
|---|---|
texts
|
List of text strings to search for paper URLs.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
list[str]
|
List of extracted URLs, deduplicated and normalized. |
Notes
This method:
- Extracts URLs from Slack message format:
is_paper_url
staticmethod
¶
Check if a URL points to a paper or academic resource.
| PARAMETER | DESCRIPTION |
|---|---|
url
|
URL to check.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
bool
|
True if URL matches known paper domains and doesn't match exclusion patterns. |
normalize_url
staticmethod
¶
Normalize a URL for consistent comparison and deduplication.
| PARAMETER | DESCRIPTION |
|---|---|
url
|
URL to normalize.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
str
|
Normalized URL with: - Lowercase scheme and domain - Removed tracking parameters - Removed fragments - Trailing slashes standardized |
Notes
This removes common tracking parameters like utm_* and fbclid to ensure papers shared from different sources are deduplicated.
Type Reference¶
| Type | Description |
|---|---|
SlackPaper |
Dataclass representing a paper found in Slack. Fields: channel_id, channel_name, shared_by, user_id, timestamp, message_ts, permalink, message_text, paper_url, reactions_count, reply_count, reaction_details. |
list[SlackPaper] |
Returned by scrape_channel(). Each entry is a unique paper URL found in the channel. |
list[str] |
Returned by extract_paper_urls(). Deduplicated, normalized paper URLs. |