download pubmed PDFs using pubmed2pdf in python

Wooki · Oct-10-2020, 09:11 AM

Hello,
Is there anyone can help me with these problems:
I want to download papers from Pubmed and save them as PDFs. Then
1. A paper in PDF version with a Pubmed ID of 28019091 was successfully downloaded using pubmed2pdf package, and the code used was:
python -m pubmed2pdf pdf --pmids="28019091" (in Win 10 terminal), but the PDF file can't be opened. This file has a free PDF version on the web of Pubmed, and after downloading two more other papers using the same code, I also failed to open the corresponding PDFs. So what's wrong with my code? The code utilized were copied from https://pypi.org/project/pubmed2pdf/
2. How to download the chosen papers all together using pubmed2pdf package? I only find the pertinent code in the link above written as $ python3 -m pubmed2pdf pdf --pmidsfile="/my/path/to/the/file", but I don't know how to generate a file with batches of Pubmed IDs and have no idea of what types of the file should be created.
I'd be greatly appreciated if anyone could help me.

ndc85430 · Oct-10-2020, 11:31 AM

For 1, can you give any more info? What happens when you run the command? Is there any output? What happens when you try to open the PDF?

For 2, you might just want to write a PowerShell script that iterates over the IDs and downloads them . I'm not a Windows user, but I'd do the equivalent on Linux and am assuming the shell on Windows has scripting capabilities.

jefsummers · Oct-10-2020, 11:44 PM

Familiar with PubMed, but not with this app. How would it get the list of selected articles?
What is the size of the downloaded pdf?

Wooki · Oct-12-2020, 12:55 AM

(Oct-10-2020, 11:31 AM)ndc85430 Wrote: For 1, can you give any more info? What happens when you run the command? Is there any output? What happens when you try to open the PDF?

Yaa, all went well when I ran the command. Here is the whole process:
C:\Users\lenovo>python -m pubmed2pdf pdf --pmids="28019091"
2020-10-12 08:08:46,450 - INFO - pubmed2pdf.utils - Trying to fetch pmid 28019091
Done downloading. All downloaded can be found in C:\Users\lenovo\pubmed2pdf

Then the PDF file was found in the default path, and when I clicked to open it in Acrobat Reader, a pop-up window appeared: Acrobat Reader failed to open "28019091.pdf" because this kind of file is not supported or the file is corrupted (eg. the file was sent as an appendix through email but not decoded correctly)". And under jefsummers's reminder, I notice that the size of the downloaded PDF was only 30kb, but the Pubmed directly-downloaded 28019091 was 408kb.

For 2, you might just want to write a PowerShell script that iterates over the IDs and downloads them . I'm not a Windows user, but I'd do the equivalent on Linux and am assuming the shell on Windows has scripting capabilities.

Sounds difficult for me but I'll try it.

jefsummers · Oct-12-2020, 09:12 PM

Played with this some, including the code that it was cloned from. Pubmed has changed its interface recently. What you get in your 30K is html and javascript rather than a pdf, and it is not the article.

Sorry, this isn't the way to do it.

J

Wooki · Oct-13-2020, 02:35 AM

(Oct-12-2020, 09:12 PM)jefsummers Wrote: Played with this some, including the code that it was cloned from. Pubmed has changed its interface recently. What you get in your 30K is html and javascript rather than a pdf, and it is not the article.

Sorry, this isn't the way to do it.

J Thank you so much for your answer. By the way, how do you get those PDFs, can you share some methods？Thanks in advance.

jefsummers · Oct-13-2020, 04:05 PM

I click on the link on the page for the article. Locate the article by search (or if you have it, by PMID) and if the full text is available there will be a link on the right side of the page.

For example, on the pubmed page do a search for Coronavirus Covid-19. Top article (today anyway) is "recent trends". Beside the PMID it says Free Aricle. Click the title of the article and you get the abstract and related articles. On the right side it says "Free Full Text". Click that and you get the full article as a webpage, and below the bottom of that is a link to get the PDF. OK, lot of steps, but it works for the articles that are available free.

A word of explanation for anyone else who may be reading this and is curious. The (US) National Institutes of Health runs the National Library of Medicine, which indexes the medical literature (not all of it, but pretty much all that is significant). Pubmed is a search engine designed to search that index. You can just word search or you can use special headings to limit the search, such as hydroxychloroquine with a MeSH (Medical SubHeading) of therapeutic use.

Wooki · Oct-19-2020, 05:50 AM

(Oct-13-2020, 04:05 PM)jefsummers Wrote: I click on the link on the page for the article. Locate the article by search (or if you have it, by PMID) and if the full text is available there will be a link on the right side of the page.

For example, on the pubmed page do a search for Coronavirus Covid-19. Top article (today anyway) is "recent trends". Beside the PMID it says Free Aricle. Click the title of the article and you get the abstract and related articles. On the right side it says "Free Full Text". Click that and you get the full article as a webpage, and below the bottom of that is a link to get the PDF. OK, lot of steps, but it works for the articles that are available free.

A word of explanation for anyone else who may be reading this and is curious. The (US) National Institutes of Health runs the National Library of Medicine, which indexes the medical literature (not all of it, but pretty much all that is significant). Pubmed is a search engine designed to search that index. You can just word search or you can use special headings to limit the search, such as hydroxychloroquine with a MeSH (Medical SubHeading) of therapeutic use.

Thank you jefsummers. Big Grin

I know this method, but when there are bunches of articles to download, this method will be inefficient, so I want to download with Python.

jefsummers · Oct-19-2020, 03:06 PM

Since the publishers store them on their sites rather than being stored at Pubmed (NLM), you will need to webscrape the address then link to the publisher (Elsevier, for example).

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	Extracting data from bank statement PDFs (Accountant)	a4avinash	4	4,903	Feb-27-2025, 01:53 PM Last Post: griffinhenry
	Comparing PDFs	CaseCRS	5	3,174	Apr-01-2023, 05:46 AM Last Post: DPaul
	python multiprocessing to download sql table	mg24	5	2,690	Oct-31-2022, 03:53 PM Last Post: Larz60+
	download with internet download manager	coral_raha	0	3,999	Jul-18-2021, 03:11 PM Last Post: coral_raha
	How can I download Python files from GitHub?	bitcoin10mil	2	3,573	Aug-26-2020, 09:03 PM Last Post: Axel_Erfurt
	How to compare two PDFs for differences	Normanie	2	3,165	Jul-30-2020, 07:31 AM Last Post: millpond
	python download manager with progressbar (not gui)	ghostblade	1	2,422	Apr-23-2020, 11:05 AM Last Post: snippsat
	Concatenate multiple PDFs using python	gmehta1996	0	2,618	Mar-29-2020, 09:48 PM Last Post: gmehta1996
	Python Download	GillietheSquid	2	2,675	Mar-27-2020, 09:15 PM Last Post: GillietheSquid
	Most optimized way to merge figures from multiple PDFs into one PDF page?	dmm809	1	2,644	May-22-2019, 10:32 PM Last Post: micseydel

download pubmed PDFs using pubmed2pdf in python

User Panel Messages

Announcements