If you read a lot of arXiv articles you have probably noticed that arXiv PDFs are named in the YYMM.number.pdf format, where YYMM specifies year/month, and is followed by a zero-padded sequence number. This makes searching for stored articles difficult. I would much prefer to have the article’s title or author names as part of the filename.

In this post I show how to automate this process so that the as soon as an arXiv PDF is saved to a directory it is renamed to include the title in the filename.

Python rename script

At the heart of this automation is a Python script that accepts the file path as the argument, uses the arXiv python API package to get the article title, and then renames the file into the title_YYMM.number.pdf. The required arxiv package can be installed via pip.

#!/usr/bin/env python3

import arxiv
import os
import sys
import re

def get_valid_filename(s):
    """
    Return the given string converted to a string that can be used for a clean
    filename. Remove leading and trailing spaces; convert other spaces to
    underscores; and remove anything that is not an alphanumeric, dash,
    underscore, or dot.

    >>> get_valid_filename("john's portrait in 2004.jpg")
    'johns_portrait_in_2004.jpg'

    https://github.com/django/django/blob/master/django/utils/text.py
    Copyright (c) Django Software Foundation and individual contributors.
    All rights reserved.
    """
    s = str(s).strip().replace(' ', '_')
    return re.sub(r'(?u)[^-\w.]', '', s)

basename = os.path.basename(sys.argv[1])
dirname = os.path.dirname(sys.argv[1])
article_id = basename.strip('.pdf')

entry = arxiv.query(id_list=[article_id])
title = entry[0]['title']
title_slug = get_valid_filename(title)

new_basename = title_slug + '_' + basename
new_path = os.path.join(dirname, new_basename)

os.rename(sys.argv[1], new_path)

Automatically invoking the rename script

In linux we can use incron to automatically invoke the rename script when a new file is added to a directory. incron is similar to cron, but instead of running commands based on time, it runs commands based on filesystem events. It is typically not installed on Linux distros by default but can be found in most package managers. For example in Debian-based distros it can be installed via:

$ sudo apt-get install incron

Note that installing the packages does not necessaarily start the daemon so make sure the service is running:

$ systemctl enable incron.service  # Start incron at boot time
$ systemctl start incron.service  # Start incron now

Simialr to cron, incrob is driven by a table where each line of the table has the following format:

watched_path event_type command

The table can be edited with incrontab:

$ incrontab -e

Add the following line replacing path to match the directory where you store your PDFs:

/PATH/TO/DIR    IN_CREATE    [[ $# =~ [0-9]+\.[0-9]+\.pdf ]] && /PATH/TO/rename_arxiv [email protected]/$#

The IN_CREATE event fires when new a file or directory is created in the watched directory. The regular expression test in the first part of the command ensures we only call the script for files that match the arXiv filename pattern. incron provides some wildcards that can be used in the command section. The above command uses the following:

  • [email protected] expands to the watched directory path
  • $# expands to the event-related filename

What about macOS?

The rename script obviously works in macOS as well. The incron functionality can be achieved via BSD’s wait_on or the Hazel app:

Hazel renaming arXiv file