How to automatically link LaTeX .bib references to a set of pdf full-text articles?

Question

I have a huge collection of PDFs of research papers. Many of these have valuable annotations. I also have a huge .bib file containing citations for these and many other works. Is there a reference manager software where I could import the .bib file and the collection of PDFs and somehow the entries in the .bib file could be magically linked to the corresponding PDFs? I would then like to use that tool to access my PDFs (of research papers). I think this was a feature request for mendeley long back http://feedback.mendeley.com/forums/4941-general/suggestions/80946-automatically-find-pdfs-link-them-to-imported-me . As of today, I don't think that it has been implemented. I tried Quiqqa (http://www.qiqqa.com/) , but had no luck.

I think that this question is more relevant to Academia that to the tex forum. I suspect the people at the tex forum would consider it off-topic (see http://tex.stackexchange.com/questions/121961/free-reference-manager-and-pdf-organizer-for-windows#comment376604_121961). My original question did not have latex in the title. latex/bib is not the only way to manage references and it is easy to interconvert between reference formats. — Abhishek Anand, Mar 07 '14 at 00:00
have you check Papers app? it may have what you looking for. — seteropere, Mar 07 '14 at 00:02
If you are good with sqlite, you could import your bib files into Zotero and attach the files to your references by editing the relevant parts of the sqlite database. — , Mar 07 '14 at 00:30
@seteropere I've tried Papers a couple of times, but last check their BibTex support was weak at best. — Matthew G., Mar 07 '14 at 01:16
http://jabref.sourceforge.net/help/ExternalFiles.php says that you need to name PDF by the bibtex key. I want the tool to read inside the PDF file (title and authors) to do the matching. — Abhishek Anand, Mar 07 '14 at 01:43
We use some home-grown PHP scripts for this in my previous group. — gerrit, Mar 07 '14 at 16:40
Have you tried citeulike. I have a feeling it may have what you are looking for. — dearN, Mar 09 '14 at 22:25
Although I'm sympathetic to this trouble, I think you're asking for a feature that likely doesn't exist. What assumption can the tool make about your annotated PDF files? Is there some kind of key of where the Title/Author/DOI/etc. is stored in them? Probably not, because these are not standard for PDF or for academic articles. Yes, a search tool might find these strings, but it would have to be a pretty fuzzy data miner to find all the matches perfectly. Some PDF files use encoding which make searching virtually impossible. My advice: JabRef lets you link to them relative to the .bib file. — Fuhrmanator, Mar 16 '14 at 00:05
Also, I'd suggest the feature to the JabRef team, as I think with modern PDF it could be feasible, although probably not perfect. — Fuhrmanator, Mar 16 '14 at 00:07
@Fuhrmanator: PDF does have metadata. A random sampling of the PDF files I have in my reference library suggests perhaps at least 60 or 70% have the title of the article included. — Willie Wong, Apr 03 '14 at 11:15
@WillieWong The title is a start. But that's the easy part. My experiment with EndNote http://academia.stackexchange.com/a/18204/3859 shows that the metadata isn't sufficient or consistently used by research publishers (at least IEEE) to build a precise .bib file as the OP is asking. — Fuhrmanator, Apr 03 '14 at 15:00
@Fuhrmanator: sorry, I was just trying to contradict your statement "probably not" when it comes to whether there is some kind of key stored in the PDF file. And indeed, even given a bib file which contains all the paper titles and given a list of PDFs which store the paper titles in metadata, it is still a nontrivial task (and most likely involve some human intervention for ambiguous cases) to automatically match entries. But my understanding of the question is that tools that do something is better than tools that do nothing at all. — Willie Wong, Apr 03 '14 at 15:12
@Fuhrmanator the OP does not want to build a .bib file from scratch from the pdf metadata, they want to match an existing pdf with an existing bibtex entry. This is an easier task and leaves room for some heuristics. — Federico Poloni, Apr 13 '14 at 16:43

Abhishek Anand · Answer 1 · 2014-05-22T01:17:31.727

Here is one nearly automatic way to do it using Zotero (https://www.zotero.org/):

1) import the PDFs in Zotero. One way is to select multiple PDFs and drag them into a collection (in the LHS pane) of Zotero.

2) Select the PDF items (CTRL click in Windows for multiple selections), right click and select "Retrieve metadata from PDF". Note that this step searches online databases for missing information and seems fairly robust.

3) import the .bib file in Zotero

4) Go to the duplicates collection in the LHS panel and merge all the duplicates.

Issues:

1) In step 4, there may be false negatives if the automatically retrieved metadata (in step 2) is too different from the corresponding entry in the .bib file (step 3)

2) Step 2 might fail on old scanned PDFs.

score 4 · Answer 2 · answered Mar 07 '14 at 01:18

4

One options is BibDesk (OS X), which can track links between files and associated citations.

Personally, not a fan of what it does to the .bib file, but could suit your purpose.

answered Mar 07 '14 at 01:18

Matthew G.

3,660
1
21
25

Thanks. I upvoted this answer but I would have preferred to use a tool that works on windows/linux/android. Do you think that I could use Bibdesk to do the association and then export the library to some other (cross-platform) tool in a way that the new tool would import the file associations too? – Abhishek Anand Mar 07 '14 at 15:06
1

@Abhishek IIRC BibDesk stores file-system references as new fields inside citations, so yes. It's a pretty tight coupling though. I'm really not the best person to ask though; I have largely given up on automated citation tools until I can get around to building my dream tool :P – Matthew G. Mar 07 '14 at 15:18

score 1 · Answer 3 · answered Mar 15 '14 at 23:15

Try Tellico

A collection manager for linux which "provides default templates for books, bibliographies, videos, music, video games, coins, stamps, trading cards, comic books, and wines."

The reference manual states the following:

"If Tellico was compiled with exempi or poppler support, metadata from PDF files can be imported. Metadata may include title, author, and date information, as well as bibliographic identifiers which are then used to update other information."

Is that useful? If so, then you can check the site for reviews of Tellico and it works on the following:

Debian
Ubuntu
Gentoo
FreeBSD
openSUSE
PC-BSD
Fink (Mac OS X)
Fedora
Linux Mint
Pardus
ArchLinux

Fuhrmanator · Answer 4 · 2014-04-03T15:01:45.850

EndNote x7 has this feature, known as "PDF auto import."

I tried it and it got 0/3 of my sample PDFs correct, all from IEEE conferences initially downloaded from IEEE Xplore.

One of the three articles was closer to having a correct reference (the others were useless). But that article had PDF metadata visible in Acrobat Reader (Title, Author, Subject). EndNote got page numbers right (somehow), the DOI, but failed at the conference name, and reference type (EndNote mistakenly thought it was a journal article).

How to automatically link LaTeX .bib references to a set of pdf full-text articles?

4 Answers4

Linked