May-13-2020, 10:16 AM
Hi Team,
I have a requirement to fetch URL's from pdf, in pdf I have url which is multiline,like below
https://www.flipkart.com/dell‐vostro‐14‐...s‐vos‐3480‐
laptop/p/itmf1a0a2f37df6d?pid=COMFHTHRPGSA9DZZ&lid=LSTCOMFHTHRPGSA9DZZNGCK4U&marketplace=FLIPKART&srno=s_1_9&otracker
=AS_QueryStore_HistoryAutoSuggest_1_3_na_na_na&otracker1=AS_QueryStore_HistoryAutoSuggest_1_3_na_na_na&fm=SEARCH&iid=76edd
cb3‐cab7‐4918‐bdb0‐
d9f7bf73e149.COMFHTHRPGSA9DZZ.SEARCH&ppt=sp&ppn=sp&ssid=cu8ijgssu80000001588912967680&qH=40034c3bbcbbd998
I used regex,
url = re.findall('http[s]?://(?:[a-zA-Z]|[0-9]|[.$-_@.&+]|[!*\(\),\n]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', string)
-output: https://www.flipkart.com/dell
I used urlextract as well in both cases only first line of url is coming in output.
output: https://www.flipkart.com/dell‐vostro‐14‐...s‐vos‐3480‐
most of times I am getting only one line, can you help me here. above url will be clubbed in data.
I have a requirement to fetch URL's from pdf, in pdf I have url which is multiline,like below
https://www.flipkart.com/dell‐vostro‐14‐...s‐vos‐3480‐
laptop/p/itmf1a0a2f37df6d?pid=COMFHTHRPGSA9DZZ&lid=LSTCOMFHTHRPGSA9DZZNGCK4U&marketplace=FLIPKART&srno=s_1_9&otracker
=AS_QueryStore_HistoryAutoSuggest_1_3_na_na_na&otracker1=AS_QueryStore_HistoryAutoSuggest_1_3_na_na_na&fm=SEARCH&iid=76edd
cb3‐cab7‐4918‐bdb0‐
d9f7bf73e149.COMFHTHRPGSA9DZZ.SEARCH&ppt=sp&ppn=sp&ssid=cu8ijgssu80000001588912967680&qH=40034c3bbcbbd998
I used regex,
url = re.findall('http[s]?://(?:[a-zA-Z]|[0-9]|[.$-_@.&+]|[!*\(\),\n]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', string)
-output: https://www.flipkart.com/dell
I used urlextract as well in both cases only first line of url is coming in output.
output: https://www.flipkart.com/dell‐vostro‐14‐...s‐vos‐3480‐
most of times I am getting only one line, can you help me here. above url will be clubbed in data.