Skip to content

Scraper API

The scraper module discovers academic papers shared in Slack channels. It handles full pagination, URL extraction from Slack message formatting, domain-based filtering across 30+ academic publishers, and URL normalization for deduplication.

Quick Example

from papertrail.scraper import SlackPaperScraper

scraper = SlackPaperScraper(token="xoxb-...")
papers = scraper.scrape_channel("C0123Q7PGGP")
print(f"Found {len(papers)} papers")

# Extract URLs from raw text
urls = scraper.extract_paper_urls(["Check out https://arxiv.org/abs/2301.04821"])
# → ['https://arxiv.org/abs/2301.04821']

Constants

PAPER_DOMAINS module-attribute

PAPER_DOMAINS = {'arxiv.org', 'biorxiv.org', 'medrxiv.org', 'psyarxiv.org', 'eartharxiv.org', 'ecoevo.org', 'doi.org', 'nature.com', 'science.org', 'cell.com', 'pnas.org', 'springer.com', 'wiley.com', 'elsevier.com', 'academic.oup.com', 'tandfonline.com', 'jstor.org', 'plos.org', 'elifesciences.org', 'pubmed.ncbi.nlm.nih.gov', 'pmc.ncbi.nlm.nih.gov', 'ncbi.nlm.nih.gov', 'researchgate.net', 'academia.edu', 'github.com', 'ssrn.com', 'paperswitchcode.com', 'openreview.net', 'arxiv-vanity.com'}

EXCLUDE_PATTERNS module-attribute

EXCLUDE_PATTERNS = ['slack\\.com', 'youtu(?:\\.be|be\\.com)', 'twitter\\.com|x\\.com', 'github\\.com/?$', 'imgur\\.com', 'imgur\\.com/\\w{5,7}$']

Data Classes

SlackPaper dataclass

SlackPaper(channel_id: str, channel_name: str, shared_by: str, user_id: str, timestamp: str, message_ts: str, permalink: str, message_text: str, paper_url: str, reactions_count: int = 0, reply_count: int = 0, reaction_details: dict[str, int] = dict())

Represents a paper shared in a Slack message.

ATTRIBUTE DESCRIPTION
channel_id

Slack channel ID where paper was shared.

TYPE: str

channel_name

Human-readable channel name.

TYPE: str

shared_by

Display name of user who shared the paper.

TYPE: str

user_id

Slack user ID of the person who shared.

TYPE: str

timestamp

ISO format timestamp when paper was shared.

TYPE: str

message_ts

Slack message timestamp (for accessing thread).

TYPE: str

permalink

Direct link to the Slack message.

TYPE: str

message_text

Full message text where paper was mentioned.

TYPE: str

paper_url

Extracted and normalized paper URL.

TYPE: str

reactions_count

Total number of emoji reactions on this message.

TYPE: int

reply_count

Number of replies in the thread.

TYPE: int

reaction_details

Mapping of emoji names to reaction counts.

TYPE: dict[str, int]

SlackPaperScraper

SlackPaperScraper

SlackPaperScraper(token: str | None = None, channels: list[str] | None = None, search_queries: list[str] | None = None, rate_limit_delay: float = 0.3, use_mcp: bool = False, custom_domains: set[str] | None = None)

Comprehensive Slack paper scraper with pagination, filtering, and enrichment.

This is the main scraper class. For backward compatibility with older code, the original SlackScraper class is still available as an alias.

Supports two modes of operation: 1. Direct Slack API access via slack_sdk.WebClient (recommended) 2. MCP tool integration for remote/constrained environments

PARAMETER DESCRIPTION
token

Slack Bot Token (xoxb-...). Required for direct API mode.

TYPE: str DEFAULT: None

channels

List of channel IDs or names to scrape. If None, will scrape all channels accessible to the bot token.

TYPE: list[str] DEFAULT: None

search_queries

Custom Slack search queries to use. Defaults to searching common paper domains.

TYPE: list[str] DEFAULT: None

rate_limit_delay

Delay in seconds between API calls (default: 0.3 seconds).

TYPE: float DEFAULT: 0.3

use_mcp

If True, attempt to use MCP tools instead of direct Slack API.

TYPE: bool DEFAULT: False

ATTRIBUTE DESCRIPTION
BASE_URL

Slack API base URL.

TYPE: str

Examples:

Scrape specific channel:

>>> scraper = SlackPaperScraper(token="xoxb-...")
>>> papers = scraper.scrape_channel("C123456789")
>>> print(f"Found {len(papers)} papers")

Extract URLs from message text:

>>> messages = scraper.scrape_channel("C123456789")
>>> texts = [m.message_text for m in messages]
>>> urls = scraper.extract_paper_urls(texts)
>>> print(f"Paper URLs: {urls}")

Normalize URLs for deduplication:

>>> url1 = "https://doi.org/10.1234/example"
>>> url2 = "https://example.com?utm_source=slack"
>>> norm1 = scraper.normalize_url(url1)
>>> norm2 = scraper.normalize_url(url2)

Initialize the Slack paper scraper.

scrape_channel

scrape_channel(channel_id: str, oldest: str | float | None = None, latest: str | float | None = None, include_replies: bool = False) -> list[SlackPaper]

Scrape all messages from a channel with full pagination.

Fetches complete message history with automatic pagination and optional engagement metrics (reactions, thread replies).

PARAMETER DESCRIPTION
channel_id

The channel ID to scrape (must start with 'C').

TYPE: str

oldest

Oldest message timestamp to include (Unix timestamp or ISO format string). If None, starts from channel creation.

TYPE: str | float DEFAULT: None

latest

Newest message timestamp to include (Unix timestamp or ISO format string). If None, includes through current time.

TYPE: str | float DEFAULT: None

include_replies

If True, fetch and include thread reply counts and reaction details (slower, more API calls).

TYPE: bool DEFAULT: False

RETURNS DESCRIPTION
list[SlackPaper]

List of papers found in the channel, deduplicated by URL. Sorted by most recent first.

Notes

This method automatically handles pagination. The Slack API returns messages in batches, and this method continues fetching until all messages are retrieved.

extract_paper_urls

extract_paper_urls(texts: list[str]) -> list[str]

Extract paper URLs from a list of text strings.

Searches for URLs containing known paper domains and filters out common false positives (images, social media, etc.).

PARAMETER DESCRIPTION
texts

List of text strings to search for paper URLs.

TYPE: list[str]

RETURNS DESCRIPTION
list[str]

List of extracted URLs, deduplicated and normalized.

Notes

This method: - Extracts URLs from Slack message format: - Filters for known paper domains - Removes common false positives - Normalizes URLs for comparison

is_paper_url staticmethod

is_paper_url(url: str) -> bool

Check if a URL points to a paper or academic resource.

PARAMETER DESCRIPTION
url

URL to check.

TYPE: str

RETURNS DESCRIPTION
bool

True if URL matches known paper domains and doesn't match exclusion patterns.

normalize_url staticmethod

normalize_url(url: str) -> str

Normalize a URL for consistent comparison and deduplication.

PARAMETER DESCRIPTION
url

URL to normalize.

TYPE: str

RETURNS DESCRIPTION
str

Normalized URL with: - Lowercase scheme and domain - Removed tracking parameters - Removed fragments - Trailing slashes standardized

Notes

This removes common tracking parameters like utm_* and fbclid to ensure papers shared from different sources are deduplicated.

Type Reference

Type Description
SlackPaper Dataclass representing a paper found in Slack. Fields: channel_id, channel_name, shared_by, user_id, timestamp, message_ts, permalink, message_text, paper_url, reactions_count, reply_count, reaction_details.
list[SlackPaper] Returned by scrape_channel(). Each entry is a unique paper URL found in the channel.
list[str] Returned by extract_paper_urls(). Deduplicated, normalized paper URLs.