RePaper: automated, citation-style renaming of research-paper PDFs

Summary

Research papers downloaded from publisher and repository sites usually arrive with opaque, publisher-assigned filenames such as bioinformatics_35_3_421.pdf, which often require manual retitling to aid retrieval. RePaper is a local Python command-line tool that automates this task. It extracts a paper's own Digital Object Identifier (DOI) from the PDF, resolves the corresponding bibliographic record through the Crossref REST API, and proposes a consistent, configurable, citation-style filename. Using RePaper the example above becomes the more descriptive:

Langmead, B. et al. (2019) Scaling read aligners to hundreds of threads on general-purpose processors.pdf

The tool's governing principle is conservatism. RePaper renames a file only when it has unambiguous, resolved evidence of the paper's identity; in every other case it leaves the file untouched and flags it for human review. The central design problem it addresses is therefore not extracting a DOI - many strings in a PDF may match the DOI pattern - but determining which DOI, if any, actually identifies the paper at hand.

Background

A descriptive, citation-style filename allows a paper to be located, sorted, and cited without opening it, and degrades gracefully across filesystems and reference managers. Publisher-assigned names typically provide none of this, and manual retitling can be slow and inconsistent across a large collection. RePaper aims to produce uniform, human- and machine-readable names automatically, while never silently mislabelling a file — a wrong rename being more costly than no rename, since it hides the error behind a plausible-looking name.

How it works

Pipeline overview

Each PDF passes through five stages. The first four are read-only; only the fifth modifies the filesystem, and only when explicitly requested.

read PDF → collect DOI candidates → resolve via Crossref → decide status and build name → preview / apply

A run targets either a single PDF or a folder. For a folder, RePaper processes the PDF files directly within it, sorted by name, and does not recurse into subfolders.

Read. RePaper opens each document and extracts its metadata dictionary together with the text of its opening pages. Only the first pages are read, because a paper's own DOI, when present, almost always appears on the opening page or in the document metadata, whereas later pages are dominated by reference-list DOIs belonging to other works.

Collect candidates. RePaper gathers DOI candidates from three sources and records the provenance of each, because the source is an indicator of reliability. Candidates are ordered by source priority:

Priority	Source	Why it ranks here
0	PDF link annotations	A clickable DOI link embedded by the publisher: structured, deliberate, and resistant to text-extraction corruption.
1	Embedded DOI metadata (document-info and XMP)	A DOI recorded in the file's own metadata: structured and specific to the paper.
2	Visible page text	Available in ordinary PDFs, but vulnerable to layout flattening and to adjacent or cited-reference DOIs.

Matching the DOI form is a precondition for consideration, not evidence that a candidate belongs to this paper: a syntactically valid string may be a cited reference's DOI, a corrupted extraction, or an unrelated DOI carried inside a publisher URL. Candidates are retained even when they later fail to resolve, so that any decision can be reconstructed from the verbose log.

Resolve and decide. Candidates are resolved against Crossref and accepted only under the confidence model described in the section below.
Build the name. When a record is accepted, RePaper constructs a filename from the resolved Crossref fields — author, year, and title — using a configurable template (default: {authors_first} ({year}) {title}) and sanitising the result for cross-platform filesystem use.
Apply. By default a run is a non-writing preview: it reports the names it would assign and changes nothing. Renames occur only with --apply, or with --verify, which lists the full plan and prompts for confirmation. Files marked for review are never modified in any mode.

The confidence model

The resolution stage is where RePaper earns, or withholds, its trust, and is the tool's principal contribution. Candidates are resolved against Crossref one source-priority tier at a time, strongest tier first. A resolved record is treated as usable only if it contains a title, a publication year, and at least one author - the fields required to build a filename. Within a tier, the number of distinct DOIs that resolve to a usable record determines the outcome:

Distinct usable records in the tier	Outcome
Exactly one	Accept that record; lower-priority tiers are not consulted.
Two or more	Conflicting evidence; the file is sent to review.
None	Fall through to the next tier.

The search stops at the first tier yielding exactly one usable record. If every tier is exhausted without one, the file is left for review (or reported as an error if resolution failed through a genuine network or API fault).

The consequence is that source priority is an evidence-ordering policy, not a rule that trusts the first DOI-shaped string. A stronger source is examined first, but is accepted only when exactly one distinct DOI in its tier resolves to a usable record; a stronger source that resolves to nothing usable defers to a weaker one, and a strong source offering two equally complete but different records triggers review rather than a guess.

Worked examples

Three cases from the test corpus show the policy operating on real files:

Paper	DOI candidates (source)	Resolution
McQueen et al. (1998), Genome Research	`10.1101/gr.8.6.621` (link); `10.1101/gr.8.6.621Access` (visible text)	The link DOI resolves to a complete record and is accepted; the malformed text variant is never needed. → rename
Langmead et al. (2019), Bioinformatics	`10.1093/bioinformatics/bty648` (link, from a URL carrying a `#supplementary-data` fragment); `10.1093/bioinformatics/bty648` (visible text)	The link URL's `#supplementary-data` fragment is removed during normalisation, leaving a DOI that resolves to a complete record; it is accepted from the link tier. → rename
A review article linking a related paper	`10.1016/j.sbi.2018.11.003` and `10.1016/j.sbi.2019.06.006`, both links	Two distinct link DOIs each resolve to a complete record, so the evidence is genuinely ambiguous. → review (left unchanged)

The Langmead case shows normalisation in action: its link URL is structurally valid yet carries a #supplementary-data display fragment that would not resolve verbatim; stripping it recovers the article's own DOI, which the link tier then accepts. The review-article case shows the complementary guard: when a strong source offers two equally complete but different records, the file is referred for review rather than resolved by guesswork.

Evaluation

Corpus

RePaper was evaluated against an assembled corpus of 74 research-paper PDFs: a mix of primary research articles, reviews, and shorter communications, drawn from across biological, bioinformatics, clinical, physics, mathematics, materials, and computational fields. Every file was left under its original publisher- or repository-assigned name. The corpus is disciplinarily broad but modest in size, and is not a random sample of the literature.

Method

Two independent processes examined the same files, and their outputs were cross-tabulated:

RePaper was run in its default, non-writing preview mode, proposing filenames and recording a per-file outcome and reason without altering any file.
An independent DOI scan searched the same files for any DOI-like string across visible text, metadata, and link annotations, plus a raw catch-all over the decompressed file. This scan used three general-purpose third-party tools — Poppler (pdftotext), exiftool, and qpdf — together with a Crossref-recommended DOI pattern rather than a copy of RePaper's own pattern, so that the two processes shared no extraction code.

The independent scan provides corroborating evidence only. A DOI string being present does not mean a rename was warranted; it may belong to a cited work, and the absence of a string does not prove the paper has no DOI in existence. In addition, every proposed retitle was verified by inspection against the source PDF, confirming that the first author, year, and title matched the paper and that the filename was well formed.

Results

RePaper retitled 37 files, left 37 for review, and reported no errors.

Outcome	Files
Retitle	37
Review	37
Error	0

Cross-tabulating each outcome against whether the independent scan found any DOI string:

RePaper outcome	DOI string present	No DOI string	Total
Retitle	37	0	37
Review	24	13	37
Total	61	13	74

Two cells confirm the safety model directly. No file was renamed in which an independent toolchain could find no DOI at all (retitle + no string = 0): RePaper does not act without evidence. And for thirteen review files, a separate toolchain reading the same text, metadata, and annotations also found no actionable DOI (review + no string = 13), independently corroborating the decision to leave them alone.

The 24 files RePaper left for review despite a DOI string being present demonstrate that presence is not the same as an actionable identifier. Twenty-three are correct declines: most are arXiv preprints whose only own-identifier is an arXiv DOI that does not resolve to a complete journal record, the remainder carrying only reference-list DOIs from cited works (one with genuinely conflicting candidates). The twenty-fourth is a deliberate limitation: a paper that records its own DOI solely on an appended repository "stamp" page at the end of the file, outside the opening-pages window RePaper reads by design.

On these figures:

Precision: 37/37. Every automatic retitle was manually verified correct. There were no false positives and no file was renamed on ambiguous evidence.
Recall: 37/38 (~97%). Of the thirty-eight files carrying a resolvable own-DOI, RePaper recovered thirty-seven; the single miss is the back-matter-stamp case above.

Interpretation

Across a disciplinarily broad set, every filename RePaper proposed was correct, and it never acted without a resolved DOI. The independent scan adds a second, corroborating line of evidence: RePaper's review decisions track the genuine absence or non-resolution of an actionable DOI rather than missed identifiers, with one well-understood and deliberate exception. These results characterise behaviour on this specific corpus and should be read as such. Resolution depends on the live Crossref API, so the recall figure reflects the state of Crossref records at the time of the run, and a high-confidence retitle remains a strong proposal rather than a guarantee of bibliographic correctness.

Limitations and future work

DOI syntax alone does not establish that a DOI belongs to the paper; the tiered, single-usable-record policy is what guards against acting on the wrong one.
Link annotations may carry several DOI-bearing URLs, and visible text may merge a DOI with adjacent prose or include cited DOIs; conflicting resolutions are reviewed rather than guessed.
PDF text extraction and link annotations may be absent or malformed in some files.
Title-, author-, and layout-based inference is not yet implemented. It is the natural next stage, and the evidence-ranked design accommodates it directly: a future strategy enters as an additional lower-priority source, subject to the same requirement to resolve unambiguously or defer to review.

Conclusion

RePaper demonstrates that automated, citation-style renaming of research-paper PDFs can be made safe by treating DOI extraction as an evidence problem rather than a pattern-matching one. The tiered, single-usable-record confidence model ensures that the tool acts only on an unambiguous, resolved identifier and otherwise defers to a human, and the evaluation shows this behaviour holding on a broad corpus: high precision, ~97% recall, and review decisions independently corroborated as genuine non-resolutions rather than missed identifiers. RePaper is therefore best understood as a conservative, auditable aid to bibliographic file management; one that improves the retrieval of a paper collection rather than as a substitute for verifying a paper's metadata where that matters.

RePaper resolves metadata through the Crossref REST API and relies on the DOI system; it is not affiliated with or endorsed by either. Full installation and usage documentation is available in the project repository.