Scraper API¶

The scraper module discovers academic papers shared in Slack channels. It handles full pagination, URL extraction from Slack message formatting, domain-based filtering across 30+ academic publishers, and URL normalization for deduplication.

Quick Example¶

from papertrail.scraper import SlackPaperScraper

scraper = SlackPaperScraper(token="xoxb-...")
papers = scraper.scrape_channel("C0123Q7PGGP")
print(f"Found {len(papers)} papers")

# Extract URLs from raw text
urls = scraper.extract_paper_urls(["Check out https://arxiv.org/abs/2301.04821"])
# → ['https://arxiv.org/abs/2301.04821']

Constants¶

PAPER_DOMAINS `module-attribute` ¶

PAPER_DOMAINS = {'arxiv.org', 'biorxiv.org', 'medrxiv.org', 'psyarxiv.org', 'eartharxiv.org', 'ecoevo.org', 'doi.org', 'nature.com', 'science.org', 'cell.com', 'pnas.org', 'springer.com', 'wiley.com', 'elsevier.com', 'academic.oup.com', 'tandfonline.com', 'jstor.org', 'plos.org', 'elifesciences.org', 'pubmed.ncbi.nlm.nih.gov', 'pmc.ncbi.nlm.nih.gov', 'ncbi.nlm.nih.gov', 'researchgate.net', 'academia.edu', 'github.com', 'ssrn.com', 'paperswitchcode.com', 'openreview.net', 'arxiv-vanity.com'}

EXCLUDE_PATTERNS `module-attribute` ¶

EXCLUDE_PATTERNS = ['slack\\.com', 'youtu(?:\\.be|be\\.com)', 'twitter\\.com|x\\.com', 'github\\.com/?$', 'imgur\\.com', 'imgur\\.com/\\w{5,7}$']

Data Classes¶

SlackPaper `dataclass` ¶

SlackPaper(channel_id: str, channel_name: str, shared_by: str, user_id: str, timestamp: str, message_ts: str, permalink: str, message_text: str, paper_url: str, reactions_count: int = 0, reply_count: int = 0, reaction_details: dict[str, int] = dict())

Represents a paper shared in a Slack message.

ATTRIBUTE	DESCRIPTION
`channel_id`	Slack channel ID where paper was shared. TYPE: `str`
`channel_name`	Human-readable channel name. TYPE: `str`
`shared_by`	Display name of user who shared the paper. TYPE: `str`
`user_id`	Slack user ID of the person who shared. TYPE: `str`
`timestamp`	ISO format timestamp when paper was shared. TYPE: `str`
`message_ts`	Slack message timestamp (for accessing thread). TYPE: `str`
`permalink`	Direct link to the Slack message. TYPE: `str`
`message_text`	Full message text where paper was mentioned. TYPE: `str`
`paper_url`	Extracted and normalized paper URL. TYPE: `str`
`reactions_count`	Total number of emoji reactions on this message. TYPE: `int`
`reply_count`	Number of replies in the thread. TYPE: `int`
`reaction_details`	Mapping of emoji names to reaction counts. TYPE: `dict[str, int]`

SlackPaperScraper¶

SlackPaperScraper ¶

SlackPaperScraper(token: str | None = None, channels: list[str] | None = None, search_queries: list[str] | None = None, rate_limit_delay: float = 0.3, use_mcp: bool = False, custom_domains: set[str] | None = None)

Comprehensive Slack paper scraper with pagination, filtering, and enrichment.

This is the main scraper class. For backward compatibility with older code, the original SlackScraper class is still available as an alias.

Supports two modes of operation: 1. Direct Slack API access via slack_sdk.WebClient (recommended) 2. MCP tool integration for remote/constrained environments

PARAMETER	DESCRIPTION
`token`	Slack Bot Token (xoxb-...). Required for direct API mode. TYPE: `str` DEFAULT: `None`
`channels`	List of channel IDs or names to scrape. If None, will scrape all channels accessible to the bot token. TYPE: `list[str]` DEFAULT: `None`
`search_queries`	Custom Slack search queries to use. Defaults to searching common paper domains. TYPE: `list[str]` DEFAULT: `None`
`rate_limit_delay`	Delay in seconds between API calls (default: 0.3 seconds). TYPE: `float` DEFAULT: `0.3`
`use_mcp`	If True, attempt to use MCP tools instead of direct Slack API. TYPE: `bool` DEFAULT: `False`

ATTRIBUTE	DESCRIPTION
`BASE_URL`	Slack API base URL. TYPE: `str`

Examples:

Scrape specific channel:

>>> scraper = SlackPaperScraper(token="xoxb-...")
>>> papers = scraper.scrape_channel("C123456789")
>>> print(f"Found {len(papers)} papers")

Extract URLs from message text:

>>> messages = scraper.scrape_channel("C123456789")
>>> texts = [m.message_text for m in messages]
>>> urls = scraper.extract_paper_urls(texts)
>>> print(f"Paper URLs: {urls}")

Normalize URLs for deduplication:

>>> url1 = "https://doi.org/10.1234/example"
>>> url2 = "https://example.com?utm_source=slack"
>>> norm1 = scraper.normalize_url(url1)
>>> norm2 = scraper.normalize_url(url2)

Initialize the Slack paper scraper.

scrape_channel ¶

scrape_channel(channel_id: str, oldest: str | float | None = None, latest: str | float | None = None, include_replies: bool = False) -> list[SlackPaper]

Scrape all messages from a channel with full pagination.

Fetches complete message history with automatic pagination and optional engagement metrics (reactions, thread replies).

PARAMETER	DESCRIPTION
`channel_id`	The channel ID to scrape (must start with 'C'). TYPE: `str`
`oldest`	Oldest message timestamp to include (Unix timestamp or ISO format string). If None, starts from channel creation. TYPE: `str \| float` DEFAULT: `None`
`latest`	Newest message timestamp to include (Unix timestamp or ISO format string). If None, includes through current time. TYPE: `str \| float` DEFAULT: `None`
`include_replies`	If True, fetch and include thread reply counts and reaction details (slower, more API calls). TYPE: `bool` DEFAULT: `False`

RETURNS	DESCRIPTION
`list[SlackPaper]`	List of papers found in the channel, deduplicated by URL. Sorted by most recent first.

Notes

This method automatically handles pagination. The Slack API returns messages in batches, and this method continues fetching until all messages are retrieved.

extract_paper_urls ¶

extract_paper_urls(texts: list[str]) -> list[str]

Extract paper URLs from a list of text strings.

Searches for URLs containing known paper domains and filters out common false positives (images, social media, etc.).

PARAMETER	DESCRIPTION
`texts`	List of text strings to search for paper URLs. TYPE: `list[str]`

RETURNS	DESCRIPTION
`list[str]`	List of extracted URLs, deduplicated and normalized.

Notes

This method: - Extracts URLs from Slack message format: - Filters for known paper domains - Removes common false positives - Normalizes URLs for comparison

is_paper_url `staticmethod` ¶

is_paper_url(url: str) -> bool

Check if a URL points to a paper or academic resource.

PARAMETER	DESCRIPTION
`url`	URL to check. TYPE: `str`

RETURNS	DESCRIPTION
`bool`	True if URL matches known paper domains and doesn't match exclusion patterns.

normalize_url `staticmethod` ¶

normalize_url(url: str) -> str

Normalize a URL for consistent comparison and deduplication.

PARAMETER	DESCRIPTION
`url`	URL to normalize. TYPE: `str`

RETURNS	DESCRIPTION
`str`	Normalized URL with: - Lowercase scheme and domain - Removed tracking parameters - Removed fragments - Trailing slashes standardized

Notes

This removes common tracking parameters like utm_* and fbclid to ensure papers shared from different sources are deduplicated.

Type Reference¶

Type	Description
`SlackPaper`	Dataclass representing a paper found in Slack. Fields: `channel_id`, `channel_name`, `shared_by`, `user_id`, `timestamp`, `message_ts`, `permalink`, `message_text`, `paper_url`, `reactions_count`, `reply_count`, `reaction_details`.
`list[SlackPaper]`	Returned by `scrape_channel()`. Each entry is a unique paper URL found in the channel.
`list[str]`	Returned by `extract_paper_urls()`. Deduplicated, normalized paper URLs.

Scraper API¶

Quick Example¶

Constants¶

PAPER_DOMAINS module-attribute ¶

EXCLUDE_PATTERNS module-attribute ¶

Data Classes¶

SlackPaper dataclass ¶

SlackPaperScraper¶

SlackPaperScraper ¶

scrape_channel ¶

extract_paper_urls ¶

is_paper_url staticmethod ¶

normalize_url staticmethod ¶

Type Reference¶

PAPER_DOMAINS `module-attribute` ¶

EXCLUDE_PATTERNS `module-attribute` ¶

SlackPaper `dataclass` ¶

is_paper_url `staticmethod` ¶

normalize_url `staticmethod` ¶