14

I am subscribed to the arXiv daily digests for my subject area. I find myself, every day, doing the following, when I get to work:

1) Browsing through the arXiv email and right-clicking on the articles of interest to me (to open each of the pages in a different tab in my browser)

2) For each of these pages, downloading the pdf by clicking on the link, re-naming it <article-title> by <author name(s)>.pdf, and saving the pdf in a directory on my PC

[3) the directory is then automatically synched to my tablet during the day]

4) Looking at the articles on my commute home on my tablet.

I am bored of doing (2). It only takes 10-20 seconds per article (depending on things like how many carriage returns or inappropriate characters I have to remove from the cut-and-pasted article title and author names -- e.g. math characters (I am a mathematician and symbols in titles do not cut and paste well)) but when I'm interested in 5 articles and one has a lengthy title with carriage returns and symbols in, my mind wanders and I start thinking about whether this has already been automated by someone, because it seems to me that there would be no obstruction to doing so in theory, but I would not be capable of doing it myself. Does such an automation exist?

jakebeal
  • 187,714
  • 41
  • 655
  • 920
eric
  • 241
  • 2
  • 3
  • Well, computers exist, therefore it can be automated. Whether or not this would take more time and be a greater headache than to just keep doing it manually is another matter. You'd be surprised how many days can sometimes be put into coding up something that saves a matter of seconds. I have no idea if the automation exists. Last time I had a functioning smart phone, there was an app that was supposed to automatically download papers meeting certain criteria, but it wasn't functioning and (clearly) out of date. So...try a google search? – zibadawa timmy Oct 07 '15 at 11:24
  • 8
    Be aware that arXiv has a rate measuring facility they use to lock suspected bots out. And that you have to interact with arXiv personnel to get un-blocked. You don't want to be too successful at automating this procedure. – dmckee --- ex-moderator kitten Oct 07 '15 at 19:09
  • Thanks for the answers. Term has just started here and I'm quite busy but I do intend to come back to this and accept an answer once I've managed to weigh up my options. – eric Oct 13 '15 at 15:17
  • I have used both Google Alerts and Mention for getting alerts about certain topics and changes to certain websites. For downloading multiple files I have used the Firefox extension Down Them All. I'm not putting these in an answer because I haven't tested them with arXiv. – aparente001 Apr 25 '18 at 15:11
  • 2
    @eric Must have been a long term – Azor Ahai -him- Sep 13 '21 at 21:08

9 Answers9

9

Here you go!

Take any of the new or recent links from https://arxiv.org/ and substitute it under Settings.

#!/usr/bin/python3
# encoding=utf8

import os, re, subprocess, sys import urllib.request as urllib2 import urllib.parse from bs4 import BeautifulSoup

version = 1.0

arguments = {} arguments['-h, --help'] = 'Print help' arguments['-v, --version'] = 'Print Version'

================== Settings ====================

url = "https://arxiv.org/list/astro-ph/new"

================================================

class color: PURPLE = '\033[95m' CYAN = '\033[96m' DARKCYAN = '\033[36m' BLUE = '\033[94m' GREEN = '\033[92m' YELLOW = '\033[93m' RED = '\033[91m' BOLD = '\033[1m' UNDERLINE = '\033[4m' END = '\033[0m'

if name == "main":

=============== Argument parser=================

if any([1 if arg in sys.argv else 0 for arg in ['-v', '--version']]): print(version) sys.exit(0)

if any([1 if arg in sys.argv else 0 for arg in ['-h', '--help']]):

name = os.path.basename(sys.argv[0])

# Display help
print(&quot;This is {program}. Get your daily arXiv-dose.\n&quot;.format(program=name))
print(&quot;Usage: ./{program}&quot;.format(program=name))
print(&quot;Currently I'm fetching&quot;, url, '\n')

for key in arguments:
    print(&quot;\t{:15}: {}&quot;.format(key, arguments[key]))

sys.exit(0)

================================================

============ Generate and fetch url ============

try: req = urllib2.Request(url, headers={'User-Agent': 'Mozilla/5.0'}) html = urllib2.urlopen(req)

except urllib2.HTTPError: print(url) print('"{}" not found. Correct spelling?'.format(search)) sys.exit(0)

================================================

================= Find papers ==================

soup = BeautifulSoup(html, "lxml")

articles = {}

Get DOI and URL

papers = soup.find_all("dt")

for c, nnn in zip( papers, range( len(papers) ) ):

articles[nnn] = {}

doi = c.find_all(&quot;a&quot;, title=&quot;Abstract&quot;)[0]
doi = doi.get_text()
articles[nnn][&quot;doi&quot;] = doi

link = c.find_all(&quot;a&quot;, title=&quot;Download PDF&quot;)[0].get(&quot;href&quot;)
articles[nnn][&quot;url&quot;] = 'https://arxiv.org' + link

Get Title, Authors and Abstract

meta = soup.find_all("div", class_="meta")

for c, nnn in zip(meta, range(len(meta))):

title = c.find(&quot;div&quot;, class_=&quot;list-title&quot;)
title = title.get_text().replace('Title: ','')
articles[nnn][&quot;title&quot;] = title.strip()

authors = c.find(&quot;div&quot;, class_=&quot;list-authors&quot;)
authors = authors.get_text().replace('Authors:','').replace('\n','')
authors = re.sub('[a-zA-Z]+\.+\ ','',authors)
articles[nnn][&quot;authors&quot;] = authors.strip()

try:
  abstract = c.find(&quot;p&quot;, class_=&quot;mathjax&quot;).get_text().replace('\n',' ')
except AttributeError:
  pass

articles[nnn][&quot;abstract&quot;] = abstract

List findings

for paper in articles.keys():

print( '\n' + color.BOLD + color.UNDERLINE +'{:5}'.format(paper) + color.END,
       articles[paper][&quot;title&quot;])
print( 6 * ' ' + articles[paper][&quot;authors&quot;], '\n' )
print( ' ' + articles[paper][&quot;abstract&quot;] )

Get user input list

while True:

download = input( '\n' + color.BOLD + 'Download (2 12 ..): ' + color.END )

try:
  download = [ int(i) for i in download.split() ]
  break

except ValueError:
  print('Not a valid list: &quot;{}&quot;'.format(download))
  pass

for file in download:

url = articles[file][&quot;url&quot;]
filename = '{}-{}-{}.pdf'.format(articles[file][&quot;title&quot;], articles[file][&quot;authors&quot;], articles[file][&quot;doi&quot;])

# EXT4 limits filenames to 255 characters

if len(filename) &gt; 254:

  filename = articles[file][&quot;title&quot;] + '-'

  for author in articles[file][&quot;authors&quot;].split():
    if len(author) + len(filename) + len(articles[file][&quot;doi&quot;]) + 5 &lt; 255:
      filename += author.strip()

  filename = filename[:-1] + '-' + articles[file][&quot;doi&quot;] + &quot;.pdf&quot;
  print(color.BOLD + 'Warning:' + color.END + 'Too many authors for |filename| &lt; 256.')
  print('Truncating to ', filename)

# Download
subprocess.call([&quot;wget&quot;, '--quiet', '--show-progress', '--header', &quot;User-Agent: Mozilla/5.0&quot;, &quot;--output-document&quot;, '{}'.format(filename), url])

This will give you a complete list with title, author and abstract. You can then enter a list of numbers to download as {title}-{authors}-{doi}.pdf.

[...]

120 Flavours in the box of chocolates: chemical abundances of kinematic substructures in the nearby stellar halo Jovan Veljanoski, Amina Helmi

Different subtleties and problems associated with a nonrelativistic limit of the field theory to the Schroedinger theory are discussed. In this paper, we revisit different cases of the nonrelativistic limit of a real and complex scalar field in the level of the Lagrangian and the equation of motion. We develop the nonrelativistic limit of the Dirac equation and action in the way that the nonrelativistic limit of spin-$\frac{1}{2}$ wave functions of particles and antiparticles appear simultaneously. We study the effect of a potential like $U(\phi)\propto \phi^4$ which can be attributed to axion dark matter field in this limit. We develop a formalism for studying the nonrelativistic limit of antiparticles in the quantum mechanics. We discussed the non-local approach for the nonrelativistic limit and its problems.

121 The Masses and Accretion Rates of White Dwarfs in Classical and Recurrent Novae Michael Shara, Dina Prialnik, Yael Hillman, Attay Kovetz

Different subtleties and problems [...]

Download (2 12 ..):


Installation

  1. Save the script as arxiv in /usr/local/bin
  2. chmod +x /usr/local/bin/arxiv

You should now be able to execute it by just typing arxiv in terminal.

Requirements

  • python3
  • python-beautifulsoup4 (pip install bs4 if you use python-pip package)
  • wget

Current version of this script is available here.

Mateen Ulhaq
  • 262
  • 1
  • 9
Suuuehgi
  • 191
  • 1
  • 4
  • 1
    Neat! You should add a more user-friendly explanation on how to use it: not everyone is familiar with scripts, just give a couple of words on the need for a terminal and python. – Clément Apr 25 '18 at 14:31
  • 1
    Thank you for the hint! I added a short explanation (to this [now MWE] version). I also rewrote some parts and added a few functions. The setup has now been moved to a separate config file ~/.config/arxiv.conf configurable via the script. You find it under the github link at the end of the post. When I find the time, I'll add it to AUR. – Suuuehgi May 06 '18 at 15:39
7

You can use JabRef for this. More precisely, there are plugins by Christoph Lehner that do the job:

  • arxiv-rss to browse the list of new preprints and import the ones you want;
  • localcopy for downloading the PDF and automatically renaming it (according to a pattern you can define).

This isn't 100% automated (you still need to manually click "download arXiv PDF" once you've imported the entry in your bib file), but this is still much better than doing it all by hand.

4

The following script uses arxiv python API to download friendly named files.

Features:

  • Inputs: URLs or .pdf files.
  • Outputs: corrected .pdf files.
  • Corrects .pdf files that have already been downloaded!

Script:

import os, re, sys, arxiv

def fix_title(title: str) -> str: return re.sub(r'[<>:"/\|?*]', "_", re.sub(r"\s\n+\s", " ", title))

def paper_to_filename(paper: arxiv.Result) -> str: author_str = str(paper.authors[0]) + " et al." * (len(paper.authors) > 1) return f"{author_str} - {fix_title(paper.title)}.pdf"

def parse_line(line: str): m = re.match(r".*(?P<paper_id>\d{4}.\d{4,6}(v\d+)?)(.pdf)?$", line) return m.group("paper_id") if m is not None else None

paper_ids = [parse_line(line.strip()) for line in sys.stdin.readlines()] paper_ids = [x for x in paper_ids if x is not None] papers = arxiv.Search(id_list=paper_ids).results()

for paper, paper_id in zip(papers, paper_ids): src_filename = f"{paper_id}.pdf" dst_filename = paper_to_filename(paper) if os.path.exists(dst_filename): print(f"[TargetExists] {dst_filename}") elif os.path.exists(src_filename): print(f"[Rename] {src_filename}") os.rename(src_filename, dst_filename) else: print("[Download]") paper.download_pdf(filename=dst_filename) print(f"file: {dst_filename}") print(f"url: {paper.entry_id}") print(f"authors: {[str(x) for x in paper.authors]}") print(f"title: {paper.title}\n")

Dependencies:

pip install arxiv==1.4.2

Example 1:

Automatically correct all PDF filenames inside a directory:

λ ls
1506.02640.pdf

λ ls *.pdf | python arxiv_downloader.py [Rename] 1506.02640.pdf file: Joseph Redmon et al. - You Only Look Once_ Unified, Real-Time Object Detection.pdf url: http://arxiv.org/abs/1506.02640v5 authors: ['Joseph Redmon', 'Santosh Divvala', 'Ross Girshick', 'Ali Farhadi'] title: You Only Look Once: Unified, Real-Time Object Detection

λ ls Joseph Redmon et al. - You Only Look Once_ Unified, Real-Time Object Detection.pdf

Example 2:

Manually provide your own list of URLs, filenames, or paper IDs:

λ wget -nc https://arxiv.org/pdf/1506.02640.pdf
λ wget -nc https://arxiv.org/pdf/1502.03167v3.pdf

λ echo "https://arxiv.org/abs/2002.00157 https://arxiv.org/pdf/1805.11604 1506.02640.pdf 1502.03167v3" | python arxiv_downloader.py

[Download] file: Mateen Ulhaq et al. - Shared Mobile-Cloud Inference for Collaborative Intelligence.pdf url: http://arxiv.org/abs/2002.00157v1 authors: ['Mateen Ulhaq', 'Ivan V. Bajić'] title: Shared Mobile-Cloud Inference for Collaborative Intelligence

[Download] file: Shibani Santurkar et al. - How Does Batch Normalization Help Optimization?.pdf url: http://arxiv.org/abs/1805.11604v5 authors: ['Shibani Santurkar', 'Dimitris Tsipras', 'Andrew Ilyas', 'Aleksander Madry'] title: How Does Batch Normalization Help Optimization?

[Rename] 1506.02640.pdf file: Joseph Redmon et al. - You Only Look Once_ Unified, Real-Time Object Detection.pdf url: http://arxiv.org/abs/1506.02640v5 authors: ['Joseph Redmon', 'Santosh Divvala', 'Ross Girshick', 'Ali Farhadi'] title: You Only Look Once: Unified, Real-Time Object Detection

[Rename] 1502.03167v3.pdf file: Sergey Ioffe et al. - Batch Normalization_ Accelerating Deep Network Training by Reducing Internal Covariate Shift.pdf url: http://arxiv.org/abs/1502.03167v3 authors: ['Sergey Ioffe', 'Christian Szegedy'] title: Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

Mateen Ulhaq
  • 262
  • 1
  • 9
  • There is a filename with a colon in your example; I'm hoping the actual package doesn't do this... – darij grinberg Sep 13 '21 at 19:19
  • 1
    @darijgrinberg You can replace invalid characters for your operating system with "_" by modifying the fix_title function. I have also updated the script to work with the latest arxiv==1.4.2 release. – Mateen Ulhaq Sep 14 '21 at 00:24
3

IMO, the way to implement this would be as a browser or e-mail client extension. Personally, I have a subscription for e-mail alerts because it lets me select subject areas. Recently I looked into extensions to make that more easily readable, so I searched the Firefox and Thunderbird extensions for things related to arXiv, but my search turned up nothing interesting.

Of course, it could exist e.g. as an extension for some other browser; you know, it is tough to prove nonexistance.

Given the arXiv ID, an “external“ solution should be pretty easy (getting the PDF is a simple manner of wget http://arxiv.org/pdf/$ID, and extracting the title and authors from the abs/$ID page should be simple enough, as well). However, the question is what you gain by that.

Instead, I would recommend Zotero as an alternative type of solution. It lets you save article metadata from arXiv and many other sources in the click of a button. It can also download and archive PDFs automatically, on your PC or on their server (where you get a very limited amount of space for free and can pay to get more).

xebtl
  • 346
  • 1
  • 7
  • 2
    arXiv has an API, so at least use that instead of parsing the HTML. Anyway, as you can see with my answer, there are tools designed for this already... –  Oct 07 '15 at 11:49
  • @NajibIdrissi Thanks for the pointer to the API, that is a good point. – xebtl Oct 07 '15 at 11:53
3

Try Zotero which can save multiple items from certain Web pages (you'd have to view the list in a browser).

WBT
  • 4,029
  • 3
  • 19
  • 36
1

This is probably not quite the right fit for your use case, but it should be mentioned: Arxiv does have several relevant help pages:

The TLDR is that there are various interfaces for programmatic access to arxiv, but indiscriminate mass downloads will be blocked.

Thomas
  • 18,342
  • 7
  • 45
  • 69
1

If you prefer a browser plugin to automatically rename downloaded PDFs while browsing, you may want to install arxiv-utils (I'm the author). The plugin simply adds a Direct Download button in arxiv web pages, and the naming convention is customizable.

direct-download

customizable-filename

J3soon
  • 111
  • 3
1

Here is my Python script that will search Gravitational Wave(GW) related papers on arXiv based on the given input inputs(dates) and it will extract title, abstruct, arxiv id and author name; and it will create a txt file containing those information. It can be customized for different fields by changing the URL. Default search url is:

# Default :
## Search terms - gravitation wave (all fields) or, eccentric signals(all fields)
## search subject - physics - including cross-listed papers
## Search date - <accoding to input> - submission date (most recent)
## Abstruct - show, size - 200

For any other field:

Visit https://arxiv.org/search/advanced and for the first time,

edit the search engine with the required terms and subjects and dates then,

search and copy the url and paste the url below and replace dates in the url as {} to format custom dates there. (as below,)

Main Script

#!/path/anaconda3/bin/python
#This script will search extract gw-related paper summery from arxiv based on input dates and make a text from from that data
# By- Anik Mandal

from bs4 import BeautifulSoup # You need to have this library in your conda environment import requests import datetime as dt import numpy as np import os

print("--------------->>>>>|GW NEWS|<<<<<--------------------------------------------------------------------") print("(This script will search gw papers on arxiv based on input dates)\n") print("Write a period type [last_day, last_week] (Press <Enter> for custom date search)") action = input(">>>")

if action == "last_day": to_ = dt.date.today() from_ = to_ - dt.timedelta(days=1) file_name = "arxiv_gwdaily_{}.txt".format(from_)

elif action == "last_week": to_ = dt.date.today() from_ = to_ - dt.timedelta(days=7) file_name = "arxiv_gwweekly_{}{}.txt".format(from, to_)

else: from_ = input("From date [YYYY-MM-DD] : ").split("-") to_ = input("To date [YYYY-MM-DD] : ").split("-") from_ = dt.date(int(from_[0]), int(from_[1]), int(from_[2])) to_ = dt.date(int(to_[0]), int(to_[1]), int(to_[2])) file_name = "arxiv_gw_{}{}.txt".format(from, to_)

file = open(file_name, "w") file.write( "--------------->>>>>|GW NEWS|<<<<<--------------------------------------------------------------------\n") file.write("-----|From : "+str(from_)+"\t To: "+str(to_)+"|--------------------------------------------------------------\n\n\n\n")

def gw_news(from_, to_): url = "https://arxiv.org/search/advanced?advanced=1&terms-0-operator=AND&terms-0-term=gravitational+wave&terms-0-field=all&terms-1-operator=OR&terms-1-term=eccentric+signal+&terms-1-field=all&classification-physics=y&classification-physics_archives=all&classification-include_cross_list=include&date-year=&date-filter_by=date_range&date-from_date={}&date-to_date={}&date-date_type=submitted_date&abstracts=show&size=200&order=-announced_date_first".format(from_, to_)

try:
    source = requests.get(url)
    source.raise_for_status()

    soup = BeautifulSoup(source.text, &quot;html.parser&quot;)

    papers = soup.find(&quot;ol&quot;, class_=&quot;breathe-horizontal&quot;).find_all(&quot;li&quot;, class_=&quot;arxiv-result&quot;)
    file.write(&quot;No. of papers : {}\n&quot;.format(len(papers)))
    file.write(&quot;_______________________________________________________________________________________________________\n&quot;)

    i = 1
    for paper in papers:
        title = paper.find(&quot;p&quot;, class_=&quot;title is-5 mathjax&quot;).text[8:-10]
        abstract = paper.find(&quot;p&quot;, class_=&quot;abstract mathjax&quot;).find(&quot;span&quot;, class_=&quot;abstract-full has-text-grey-dark mathjax&quot;).text[:-7]
        id_ = paper.find(&quot;p&quot;, class_=&quot;list-title is-inline-block&quot;).a.text
        author = paper.find(&quot;p&quot;, class_=&quot;authors&quot;).text[10:].replace(&quot;\n&quot;,&quot;&quot;).replace(&quot;    &quot;, &quot;&quot;)

        file.write(&quot;#&quot;+str(i)+&quot;.&quot;+title+&quot;\n&quot;)
        file.write(abstract+&quot;\n&quot;)
        file.write(id_+&quot;\t\t&quot;+author+&quot;\n \n&quot;)
        file.write(&quot;_______________________________________________________________________________________________________\n&quot;)

        i=i+1

    print(&quot;Done! Data stored in /arXiv/{}&quot;.format(file_name))

except:
    print(&quot;No new paper during this period!&quot;)
    os.remove(file_name)


return url


url = gw_news(from_, to_)

file.write("\n\n") file.write("_______________________________________________________________________________________________________\n") file.write("Site url : "+url+"\n") file.write("_______________________________________________________________________________________________________\n") print("\nSite url : ", url)

But make sure your pc is connected to the internet while the script is running.

Automation using Crontab

This script can be run daily automatically by setting the action to last_date and by adding the script to crontab. There is a way you can do it:

Open the crontab window,

$ crontab -e

add a new cronjob and save it.

0 9 * * * /path/arXiv/gwarxiv_search.py

This script will run automatically daily exactly at 9:00 AM and stores the output text file in the respective folder.

Using a Desktop application

For that, you have to write a shell command which will run the script and open the output(most recent) file. Here is the example code,

#!/bin/bash

#-----RUN THE SEARCH SCRIPTS--------------------------------------------------------------------------- source /path/anaconda3/etc/profile.d/conda.sh conda activate <conda_environment>

Get the current day of the week (0-6 where Sunday is 0)

day_of_week=$(date +%w)

Check if it's Sunday (day_of_week equals 0)

if [ "$day_of_week" -eq 0 ]; then # Execute your desired commands or script here python /path/arXiv/gw_weekly/gwweekly.py fi python /path/arXiv/gw_daily/gwdaily.py

conda deactivate #-----OPEN THE MOST RECENT FILE------------------------------------------------------------------------

Define the paths to the folders

folder1="/path/arXiv/gw_daily" folder2="/path/arXiv/gw_weekly"

Find the most recent .txt files in each folder

most_recent_file1=$(ls -t "$folder1"/.txt | head -n 1) most_recent_file2=$(ls -t "$folder2"/.txt | head -n 1)

Open the most recent .txt files in gedit

gedit "$most_recent_file1" "$most_recent_file2"

#-----END----------------------------------------------------------------------------------------------

Make the shell command executable by,

$ chmod +x /path/gwarxiv_search.sh

Then add a new desktop application (/home/Desktop/gwarxiv_search.desktop), which will run the above shell command(gwarxiv_search.sh) on click.

[Desktop Entry]
Name=GW NEWS
Comment=GW NEWS on arXiv 
Exec=/path/arXiv/utils/gwarxiv_search.sh
Icon=/path/arXiv/utils/arxiv_400x400.png         # add icon of your choice
Terminal=false
Type=Application
Categories=Utility;

Also, make this desktop application executable by,

$ chmod +x /home/Desktop/gwarxiv_search.desktop

and right click on the application to Allow Launching and it is done!

1

FWIW, my Python (2) code that downloads arXiv preprints (PDF and source) given a text file containing hyperlinks (usually, a copypasted arXiv digest from an email):

# batch-download arXiv preprints linked in a text file
# (meant for the emails that come from arXiv).
# Written for use in Cygwin or Linux; not sure how it
# behaves on a normal Windows python.
#
# Syntax:
# - "python arxdown.py mail.txt [folder]":
#   Downloads all arXiv preprints hyperlinked in "mail.txt"
#   into folder [folder].
# - "python arxdown.py https://arxiv.org/abs/1308.0047 [folder]":
#   Downloads https://arxiv.org/abs/1308.0047 into folder [folder].
# If [folder] is not specified, a default one is used.
# Even if the arXiv hyperlink comes with a version number,
# the script downloads the newest version by default; this
# behavior can be disabled with the "-u" switch.

import os
import urllib
import urllib2
import re
import time
import sys
import socket
import shutil
import string
import lxml.html
from unidecode import unidecode
import itertools

defaultpath = "/home/arxiv" # The path into which the downloads should go if no folder was specified.

arxivprefix = "http://arxiv.org" # Replace by one of the mirrors ( https://arxiv.org/help/mirrors ) if the main site is slow/down.

resting_time = 4 # time (in second) to wait between downloads; too small a number seems to get me banned.

args = sys.argv

if "-u" in args:
    # use version numbers provided
    newest = False
    args.remove("-u")
else:
    newest = True

if len(args) > 1:
    # args[1] may be either a file containing URLs, or a URL itself.
    try:
        # Is it a file?
        mail = open(args[1])
        proper_mail = True
    except IOError:
        # Nah.
        mail = [args[1]]
        proper_mail = False
    if len(args) > 2:
        # Whatever remains better be a path.
        tempdirname = args[2]
    else:
        tempdirname = defaultpath
else:
    print "no mail text or hyperlink given"
    sys.exit()

# create temporary folder for downloading, if not already existing.
try:
    os.mkdir(tempdirname)
except OSError:
    pass
os.chdir(tempdirname)

for line in mail:
    if "://arxiv.org/abs/" in line:
        # Which preprint to download?
        for arxid in line.split("://arxiv.org/abs/")[1:]:
            arxid = arxid.split(" ")[0].split("v")
            if len(arxid) > 1:
                arxid, vernum = arxid[:2]
            else:
                arxid = arxid[0]
                vernum = False
            arxid = arxid.strip()
            response = urllib2.urlopen(arxivprefix + "/abs/" + arxid)
            html = response.read().split("\n")
            # Which version to download?
            if (not newest) and vernum:
                vernum = "".join(itertools.takewhile(str.isdigit, vernum))
            else:
                for htmlline in html:
                    if "tablecell arxividv" in htmlline:
                        vernum = htmlline.split(arxid + "v")[1]
                        vernum = vernum.split("\"")[0]
                        break
            arxidv = arxid + "v" + vernum
            print "\n attacking ", arxidv
            # Build filename for the downloads.
            # I am being heavily conservative here; all kinds of
            # harmless symbols get kicked out.
            author_surnames = []
            valid_letters = string.ascii_lowercase + " -1234567890"
            for htmlline in html:
                if "citation_author" in htmlline:
                    auname = htmlline.split("citation_author\" content=\"")[1]
                    auname = auname.split(",")[0].lower()
                    auname = lxml.html.fromstring(auname).text_content()
                    auname = "".join([i for i in unidecode(unicode(auname.lower())) if i in valid_letters])
                    author_surnames.append(auname)
            author_list = "".join([author + " " for author in author_surnames])[:-1]
            print "authors: ", author_list
            for htmlline in html:
                if "citation_title" in htmlline:
                    title = htmlline.split("citation_title\" content=\"")[1]
                    title = title.split("\"")[0].lower()
                    title = lxml.html.fromstring(title).text_content()
                    title = "".join([i for i in unidecode(unicode(title.lower())) if i in valid_letters])[:75]
                    break
            arxidv_name = arxidv
            if "/" in arxidv_name:
                # This is some special-casing needed for old-style
                # arXiv IDs (such as math/0112073), since the slash
                # would confuse the file system.
                arxidv_name = arxidv_name.split("/")[1]
            resulting_filename = author_list + " - " + title + " - " + arxidv_name
            print "downloading as: ", resulting_filename
            # Downloading. The "while readsize" loop is meant to protect
            # against some temporary failures that haven't been occurring
            # lately.
            # Beware: It is stupid and might create an endless loop.
            readsize = 0
            while readsize == 0:
                urllib.urlretrieve(arxivprefix + "/pdf/" + arxidv, resulting_filename + ".pdf")
                readsize = os.stat(resulting_filename + ".pdf").st_size
                if readsize > 4500:
                    break
                testopen = open(resulting_filename + ".pdf")
                for line in testopen:
                    if "may take a little time" in line:
                        time.sleep(4)
                        readsize = 0
                        print "retrying..."
                        break
                else:
                    readsize = 6666
                testopen.close()
            urllib.urlretrieve(arxivprefix + "/e-print/" + arxidv, resulting_filename + ".tar.gz")
            print "\n resting..."
            time.sleep(resting_time)

if proper_mail:
    mail.close()
darij grinberg
  • 6,811
  • 1
  • 25
  • 43