Retrieve URL's from a pdf - Printable Version

Retrieve URL's from a pdf - Printable Version

+- Python Forum (https://python-forum.io)
+-- Forum: Python Coding (https://python-forum.io/forum-7.html)
+--- Forum: General Coding Help (https://python-forum.io/forum-8.html)
+--- Thread: Retrieve URL's from a pdf (/thread-26772.html)

Retrieve URL's from a pdf - AjayBachu - May-13-2020

Hi Team,

I have a requirement to fetch URL's from pdf, in pdf I have url which is multiline,like below

https://www.flipkart.com/dell‐vostro‐14‐3000‐core‐i5‐8th‐gen‐8‐gb‐1‐tb‐hdd‐linux‐2‐gb‐graphics‐vos‐3480‐
laptop/p/itmf1a0a2f37df6d?pid=COMFHTHRPGSA9DZZ&lid=LSTCOMFHTHRPGSA9DZZNGCK4U&marketplace=FLIPKART&srno=s_1_9&otracker
=AS_QueryStore_HistoryAutoSuggest_1_3_na_na_na&otracker1=AS_QueryStore_HistoryAutoSuggest_1_3_na_na_na&fm=SEARCH&iid=76edd
cb3‐cab7‐4918‐bdb0‐
d9f7bf73e149.COMFHTHRPGSA9DZZ.SEARCH&ppt=sp&ppn=sp&ssid=cu8ijgssu80000001588912967680&qH=40034c3bbcbbd998

I used regex,
url = re.findall('http[s]?://(?:[a-zA-Z]|[0-9]|[.$-_@.&+]|[!*,\n]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', string)
-output: https://www.flipkart.com/dell

I used urlextract as well in both cases only first line of url is coming in output.

output: https://www.flipkart.com/dell‐vostro‐14‐3000‐core‐i5‐8th‐gen‐8‐gb‐1‐tb‐hdd‐linux‐2‐gb‐graphics‐vos‐3480‐

most of times I am getting only one line, can you help me here. above url will be clubbed in data.

RE: Retrieve URL's from a pdf - keuninkske - May-13-2020

hello,

i don't know if your pdf has other text between the URL's

otherwise you can take all text from the pdf in one line
and then break up the line everythime you find 'http' or 'https'

RE: Retrieve URL's from a pdf - AjayBachu - May-14-2020

Yes Pdf has other text data as well.

Thanks for your reply.. let me know if you have any other way to fetch urls embedded in pdf along with other text

Extract Multiline URL's from pdf/text file - AjayBachu - May-15-2020

I have a file with list of url's along with other data.

We need to extract URL's.

I tried URLExtract but its not fetching more then one line.

RE: Retrieve URL's from a pdf - AjayBachu - May-15-2020

how to split the line when we find HTTP or https?

RE: Retrieve URL's from a pdf - keuninkske - May-17-2020

have you googled your latest question already?

string.split('http')

RE: Retrieve URL's from a pdf - AjayBachu - May-18-2020

If I do like this, http is getting removed from response.

string.split('http')

RE: Retrieve URL's from a pdf - keuninkske - May-18-2020

easy to add......., you know exactly what is removed

while searching and learning this thing
you will find your way trough python
at first it will be like this one, not perfect, but thats the only beginning of a learning curve

asking every easy question without searching yourself
or with beeing to precise
and not beeing solution minded
you will never get forward on the learning curve

RE: Retrieve URL's from a pdf - AjayBachu - May-19-2020

Thanks for your advice and I am not reaching here without doing prior exercise from my end.

After doing research from my end, I am posting here.
You can reply me, if you have solution otherwise no issues.