Python Forum
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Retrieve URL's from a pdf
#1
Hi Team,

I have a requirement to fetch URL's from pdf, in pdf I have url which is multiline,like below

https://www.flipkart.com/dell‐vostro‐14‐...s‐vos‐3480
laptop/p/itmf1a0a2f37df6d?pid=COMFHTHRPGSA9DZZ&lid=LSTCOMFHTHRPGSA9DZZNGCK4U&marketplace=FLIPKART&srno=s_1_9&otracker
=AS_QueryStore_HistoryAutoSuggest_1_3_na_na_na&otracker1=AS_QueryStore_HistoryAutoSuggest_1_3_na_na_na&fm=SEARCH&iid=76edd
cb3‐cab7‐4918‐bdb0‐
d9f7bf73e149.COMFHTHRPGSA9DZZ.SEARCH&ppt=sp&ppn=sp&ssid=cu8ijgssu80000001588912967680&qH=40034c3bbcbbd998

I used regex,
url = re.findall('http[s]?://(?:[a-zA-Z]|[0-9]|[.$-_@.&+]|[!*\(\),\n]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', string)
-output: https://www.flipkart.com/dell

I used urlextract as well in both cases only first line of url is coming in output.

output: https://www.flipkart.com/dell‐vostro‐14‐...s‐vos‐3480

most of times I am getting only one line, can you help me here. above url will be clubbed in data.
Reply
#2
hello,

i don't know if your pdf has other text between the URL's

otherwise you can take all text from the pdf in one line
and then break up the line everythime you find 'http' or 'https'
Reply
#3
Yes Pdf has other text data as well.

Thanks for your reply.. let me know if you have any other way to fetch urls embedded in pdf along with other text
Reply
#4
I have a file with list of url's along with other data.

We need to extract URL's.

I tried URLExtract but its not fetching more then one line.
Reply
#5
how to split the line when we find HTTP or https?
Reply
#6
have you googled your latest question already?

string.split('http')
Reply
#7
If I do like this, http is getting removed from response.

string.split('http')
Reply
#8
easy to add......., you know exactly what is removed

while searching and learning this thing
you will find your way trough python
at first it will be like this one, not perfect, but thats the only beginning of a learning curve

asking every easy question without searching yourself
or with beeing to precise
and not beeing solution minded
you will never get forward on the learning curve
Reply
#9
Thanks for your advice and I am not reaching here without doing prior exercise from my end.

After doing research from my end, I am posting here.
You can reply me, if you have solution otherwise no issues.
Reply


Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020