RePaper: automated, citation-style renaming of research-paper PDFs

Abstract

Research papers, journal articles, reviews and similar documents downloaded from publication sites usually arrive with opaque, publisher-assigned filenames such as bioinformatics_35_3_421.pdf which commonly require manual retitling to something more descriptive for later search and retrieval. RePaper is a local Python command-line tool that automatically extracts bibliographic information from research-paper PDF files and generates consistent, configurable, citation-style filenames, thus aiding the process of descriptive titling. Using the example above, RePaper can retitle the file to Langmead, B. et al. (2019) Scaling read aligners to hundreds of threads on general-purpose processors.pdf thus echoing its actual content in a human and machine-readable configurable format.

RePaper is built around a DOI-only resolution workflow. Each PDF is read to gather possible DOI evidence from three sources: embedded publisher links, PDF metadata and visible text from the opening pages. These sources are not treated equally. Link annotations and embedded metadata are considered stronger evidence than DOI strings extracted from plain text which may contain layout errors or reference-list DOIs belonging to other papers. Each DOI candidate is structurally checked, cleaned and retained with its source information. RePaper then resolves the candidates against Crossref in order of evidence strength. A DOI is accepted only when it resolves to a complete bibliographic record containing the fields needed to construct a filename: title, publication year and author information. If a stronger evidence tier fails to resolve, the tool can fall back to a weaker tier; if more than one distinct usable record is found at the same tier, the file is marked for review rather than renamed by guesswork. This tiered decision process allows the tool to cope with common PDF problems, such as malformed DOI text, supplementary-data links, missing metadata or multiple DOI-bearing links in the same document. When a record is accepted, RePaper builds a configurable citation-style filename from the Crossref metadata and the generated name is normalised and sanitised for use across filesystems.

Scope of the initial release

The 0.1.0 release (available here on GitHub) identifies a paper through a Digital Object Identifier (DOI) found inside the PDF file and resolved against a Crossref record. It does not infer bibliographic data from the PDF title block, author list, page layout, or publication header, and it never reads the existing filename as bibliographic input; the existing name is only the label to be replaced. A file that does not yield a DOI resolvable to a complete Crossref record is left unchanged and reported for review.

Processing pipeline

Each PDF passes through five stages. The first four are read-only; only the fifth touches the filesystem, and only when explicitly requested.

read PDF → collect DOI candidates → resolve via Crossref → decide status and build name → preview / apply / review

A run targets either a single PDF or a folder. For a folder, RePaper processes the PDF files directly inside it, sorted by name; it does not recurse into subfolders.

Stage 1 — Reading the PDF

For each file RePaper opens the document with pypdf in non-strict mode and extracts two things: the document-information dictionary (entries such as /Title, /Author, and an embedded /doi field), and the text of up to the first three pages, concatenated and capped at 20 000 characters. Encrypted files are tested with an empty password, which unlocks files that are encrypted but not password-locked; a file that does not unlock is reported as an error and processed no further. Per-page extraction failures are skipped individually so that one malformed page does not discard the rest.

The first three pages are used because the paper's own DOI, when present, is almost always on the opening page or in the document metadata, whereas later pages are dominated by reference-list DOIs that belong to other works.

Stage 2 — Collecting DOI candidates

RePaper gathers DOI candidates from three sources and records the provenance of each, because the source is an indicator of reliability. Candidates are ordered by source priority:

Priority	Source	Characteristics
0	PDF URI link annotations (first three pages)	A clickable link embedded by the publisher: structured, deliberate, resistant to text-extraction corruption.
1	Embedded `/doi` document metadata	A DOI field set in the file metadata: structured and specific to the paper.
2	Visible page text (first 15 000 characters)	Available in ordinary PDFs, but vulnerable to layout flattening and to adjacent or cited-reference DOIs.

Each candidate is stored with its DOI string, its source, its numeric priority, and the raw text or URI it was extracted from. Candidates are retained even when they later fail to resolve, so that any decision can be reconstructed from verbose output.

Extraction and validation rules

Before a string becomes a candidate it must pass both a structural match and a plausibility check. URI values are URL-decoded first, then scanned with the pattern \b10\.\d{4,9}/[^\s"<>[\]{}]+, and the match has trailing display punctuation (.,;:)]'") stripped. The result is then accepted only if it:

begins with 10. followed by a 4–9 digit registrant code and a slash;
contains at least one alphanumeric character in the suffix;
is no longer than 100 characters;
is not an ISSN string masquerading as a DOI; and
does not end in /full, a common false positive from publisher landing-page URLs.

Matching the DOI form is a precondition for consideration, not evidence that the candidate belongs to this paper. A syntactically valid candidate may be a cited reference's DOI, a corrupted extraction, or an unrelated DOI carried inside a publisher URL.

A representative case from the test corpus shows why provenance is kept. One paper exposes the same DOI through two sources, one clean and one corrupted by layout flattening:

link annotation: 10.1101/gr.8.6.621
visible text:    10.1101/gr.8.6.621Access

The link-annotation candidate resolves and is selected; the malformed visible-text candidate is retained for diagnostics rather than discarded.

Stage 3 — Resolution and the confidence decision

Candidates are resolved against the Crossref REST API one source-priority tier at a time, strongest tier first. Identical DOI strings are resolved only once per run. A resolved Crossref record is treated as usable only when it contains all three of a title, a publication year, and at least one author — the fields required to build a filename.

Within a single tier, the number of distinct DOIs that resolve to a usable record determines the outcome:

Distinct usable records in the tier	Outcome
Exactly one	Accept that record; lower-priority tiers are not consulted.
Two or more	Conflicting evidence; the file is sent to `REVIEW`.
None	Fall through to the next tier.

Tiers are processed strongest-first, and the search stops at the first tier that yields exactly one usable record. If every tier is exhausted without one, the file is REVIEW, unless resolution failed through a genuine network or API fault, in which case it is ERROR after the retry policy is exhausted.

The consequence is that source priority is an evidence-ordering policy, not a rule that trusts the first DOI-shaped string. A stronger source is examined first, but it is accepted only if exactly one distinct DOI in its tier resolves to a usable record; a stronger source that resolves to nothing usable defers to a weaker one.

Worked examples

Three cases from the test corpus show the policy operating on real files:

1. McQueen et al. (1998), Genome Research:

DOI candidates collected

10.1101/gr.8.6.621 — link annotation
10.1101/gr.8.6.621Access — visible text

Resolution

The tier-0 link DOI resolves to a complete record and is accepted. The malformed visible-text variant is never needed.

Outcome: rename

2. Langmead et al. (2019), Bioinformatics:

DOI candidates collected

10.1093/bioinformatics/bty648#supplementary-data — link annotation
10.1093/bioinformatics/bty648 — visible text

Resolution

The tier-0 link DOI points at a supplementary-data anchor and does not resolve in Crossref. Resolution therefore falls through to the visible-text DOI, which resolves successfully.

Outcome: rename

3. Review article linking a related paper:

DOI candidates collected

10.1016/j.sbi.2018.11.003 — link annotation
10.1016/j.sbi.2019.06.006 — link annotation

Resolution

Two distinct tier-0 DOIs each resolve to complete records, so the evidence is genuinely ambiguous.

Outcome: review — left unchanged

The Langmead case is the clearest illustration that a higher-priority source is preferred but not blindly trusted: its link annotations are structurally valid yet point past the article's own record, so a lower-priority source legitimately wins. The review-article case shows the opposite guard: when a strong source offers two equally complete but different records, the file is referred for review rather than resolved by guesswork.

Crossref requests, retries, and caching

Each distinct DOI is looked up at https://api.crossref.org/v1/works/{doi}. Requests carry a user-supplied contact email as the mailto query parameter and an identifying User-Agent (RePaper/0.1.0 (mailto:…)), as expected for Crossref's polite pool, and are paced sequentially with a default delay of 0.20 s (five requests per second). Only requests that actually reach the network are paced; a cache hit incurs no delay.

Network-level failures are retried up to three times after the initial attempt. Retryable conditions are the HTTP statuses 408, 429, 500, 502, 503, and 504, plus URL errors and read timeouts. Wait times use exponential backoff from one second with positive jitter, capped at 30 seconds; a numeric Retry-After header takes precedence when supplied. A Crossref 404 is not a network error — it means the DOI is unknown to Crossref, and the candidate is recorded as simply unresolved.

Successful responses are cached locally for 30 days. The cache key is the SHA-256 digest of the request URL with the mailto parameter removed, so the contact email is never written to disk; entries are stored as JSON and replaced atomically. A platform-appropriate cache directory is used (for example ~/Library/Caches/RePaper/crossref on macOS), and --no-cache bypasses it.

Stage 4 — Status and filename construction

The resolution outcome and the proposed name together produce one of four per-file statuses. A file whose PDF cannot be read is an ERROR. Otherwise, if a complete record was accepted, the file is a rename when the proposed name differs from the current one and is already correct when it does not. If no complete record was accepted, the file is an ERROR when resolution failed through a genuine network or API fault, and a REVIEW in every other case.

REVIEW reasons distinguish no DOI candidate found, DOI did not resolve, and record has incomplete Crossref metadata, so the report explains why each file was left alone.

When a record is accepted, the filename is built from Crossref fields:

Title — the first title entry, with inline HTML handled (superscript and subscript tags removed but their text kept, other tags replaced by spaces).
Year — the first available of published-print, published-online, published, then issued.
Authors — formatted as Family, I., preserving Crossref's separate family field so compound surnames such as de Laat stay intact; an author lacking a family field falls back to a best-effort split.

These fields populate a template. The default is {authors_first} ({year}) {title}, where {authors_first} is the first author with et al. appended when more authors exist. Other fields are available — {authors_all}, {authorN}, {etal}, {publication}, {author_count} — and every template must include {title}.

Every field is sanitised for cross-platform filesystem use: Unicode ligatures and dash variants are normalised, whitespace is collapsed, characters illegal on Windows (<>:"/\|?*) become underscores, and trailing spaces and dots are removed. The finished name is capped at 180 characters by default (minimum 32, adjustable with --max-filename-length); if it is too long the title is truncated first, and the whole stem only as a last resort.

Batch processing: retries and conflicts

The four stages above run per file across the whole target set, after which two batch-level passes run over the collected results.

First, files that failed only because their metadata request errored are retried as a group, up to three attempts, with a 30-second wait before the second and third. This is separate from the per-request network retry described earlier: it gives a transient Crossref outage time to clear without restarting the whole run. PDF-read failures, such as a password-protected file, are not retried.

Second, a conflict pass downgrades any planned rename whose target name would collide — either two files proposing the same name, or a name that already exists in the folder and is not itself freed by another rename in the same batch. A conflicting rename becomes a REVIEW; a file that is already correctly named is never downgraded merely because another file wants its name.

Stage 5 — Applying renames

By default a run is a non-writing preview: it reports the names it would assign and changes nothing. Renames occur only with --apply, or with --verify, which lists the full plan and prompts for confirmation first. REVIEW files are never modified in any mode.

When renames are applied they run in two phases so that the batch is independent of order. Every source is first moved to a unique hidden temporary name, and only then is each staged file moved to its final destination. Staging every source aside before placing any destination means that renames which reuse a name freed by another rename, swap two names, or differ only in case (which a single move would treat as the same file on a case-insensitive filesystem) all complete without overwriting an unrelated file. A destination already held by an unrelated file is skipped rather than overwritten. Each action ends as done, skipped (missing source or occupied destination), or error, and the final report tallies the outcome.

Diagnostics

Standard output is a concise count summary (Retitle/Retitled, Review, Error, and Already named when applicable). Progress is written to stderr per file as it is classified. --verbose adds the selected DOI and its source, every collected candidate, and the resolved Crossref title, publication, year, and authors — the provenance needed to audit a DOI-only result without reference to the prior filename.

Limitations and future direction

DOI syntax alone does not establish that a DOI belongs to the paper; the tiered, single-usable-record policy is what guards against acting on the wrong one.
Link annotations may carry several DOI-bearing URLs, and visible text may merge a DOI with adjacent prose or include cited DOIs; conflicting resolutions are reviewed rather than guessed.
PDF text extraction and link annotations may be absent or malformed.
Title-, author-, and layout-based inference is not implemented. It is the natural next stage, and the tiered, evidence-ranked design accommodates it directly: a future strategy enters as an additional lower-priority evidence source, subject to the same requirement to resolve unambiguously or defer to review.
Progressive candidate truncation — trimming an over-captured suffix as a bounded fallback once link and metadata evidence is exhausted — is a possible later refinement.

RePaper resolves metadata through the Crossref REST API and relies on the DOI system; it is not affiliated with or endorsed by either.