If you read a lot of arXiv articles you have probably noticed that arXiv PDFs are named in the
YYMM.number.pdf format, where
YYMM specifies year/month, and is followed by a zero-padded sequence number. This makes searching for stored articles difficult. I would much prefer to have the article’s title or author names as part of the filename.
In this post I show how to automate this process so that the as soon as an arXiv PDF is saved to a directory it is renamed to include the title in the filename.
Python rename script
At the heart of this automation is a Python script that accepts the file path as the argument, uses the arXiv python API package to get the article title, and then renames the file into the
title_YYMM.number.pdf. The required
arxiv package can be installed via
#!/usr/bin/env python3 import arxiv import os import sys import re def get_valid_filename(s): """ Return the given string converted to a string that can be used for a clean filename. Remove leading and trailing spaces; convert other spaces to underscores; and remove anything that is not an alphanumeric, dash, underscore, or dot. >>> get_valid_filename("john's portrait in 2004.jpg") 'johns_portrait_in_2004.jpg' https://github.com/django/django/blob/master/django/utils/text.py Copyright (c) Django Software Foundation and individual contributors. All rights reserved. """ s = str(s).strip().replace(' ', '_') return re.sub(r'(?u)[^-\w.]', '', s) basename = os.path.basename(sys.argv) dirname = os.path.dirname(sys.argv) article_id = basename.strip('.pdf') entry = arxiv.query(id_list=[article_id]) title = entry['title'] title_slug = get_valid_filename(title) new_basename = title_slug + '_' + basename new_path = os.path.join(dirname, new_basename) os.rename(sys.argv, new_path)
Automatically invoking the rename script
In linux we can use
incron to automatically invoke the rename script when a new file is added to a directory.
incron is similar to
cron, but instead of running commands based on time, it runs commands based on filesystem events. It is typically not installed on Linux distros by default but can be found in most package managers. For example in Debian-based distros it can be installed via:
$ sudo apt-get install incron
Note that installing the packages does not necessaarily start the daemon so make sure the service is running:
$ systemctl enable incron.service # Start incron at boot time $ systemctl start incron.service # Start incron now
incrob is driven by a table where each line of the table has the following format:
watched_path event_type command
The table can be edited with
$ incrontab -e
Add the following line replacing path to match the directory where you store your PDFs:
/PATH/TO/DIR IN_CREATE [[ $# =~ [0-9]+\.[0-9]+\.pdf ]] && /PATH/TO/rename_arxiv [email protected]/$#
IN_CREATE event fires when new a file or directory is created in the watched directory. The regular expression test in the first part of the command ensures we only call the script for files that match the arXiv filename pattern.
incron provides some wildcards that can be used in the command section. The above command uses the following:
[email protected]expands to the watched directory path
$#expands to the event-related filename
What about macOS?