Retrieve URL's from a pdf - Printable Version +- Python Forum (https://python-forum.io) +-- Forum: Python Coding (https://python-forum.io/forum-7.html) +--- Forum: General Coding Help (https://python-forum.io/forum-8.html) +--- Thread: Retrieve URL's from a pdf (/thread-26772.html) |
Retrieve URL's from a pdf - AjayBachu - May-13-2020 Hi Team, I have a requirement to fetch URL's from pdf, in pdf I have url which is multiline,like below https://www.flipkart.com/dell‐vostro‐14‐3000‐core‐i5‐8th‐gen‐8‐gb‐1‐tb‐hdd‐linux‐2‐gb‐graphics‐vos‐3480‐ laptop/p/itmf1a0a2f37df6d?pid=COMFHTHRPGSA9DZZ&lid=LSTCOMFHTHRPGSA9DZZNGCK4U&marketplace=FLIPKART&srno=s_1_9&otracker =AS_QueryStore_HistoryAutoSuggest_1_3_na_na_na&otracker1=AS_QueryStore_HistoryAutoSuggest_1_3_na_na_na&fm=SEARCH&iid=76edd cb3‐cab7‐4918‐bdb0‐ d9f7bf73e149.COMFHTHRPGSA9DZZ.SEARCH&ppt=sp&ppn=sp&ssid=cu8ijgssu80000001588912967680&qH=40034c3bbcbbd998 I used regex, url = re.findall('http[s]?://(?:[a-zA-Z]|[0-9]|[.$-_@.&+]|[!*\(\),\n]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', string) -output: https://www.flipkart.com/dell I used urlextract as well in both cases only first line of url is coming in output. output: https://www.flipkart.com/dell‐vostro‐14‐3000‐core‐i5‐8th‐gen‐8‐gb‐1‐tb‐hdd‐linux‐2‐gb‐graphics‐vos‐3480‐ most of times I am getting only one line, can you help me here. above url will be clubbed in data. RE: Retrieve URL's from a pdf - keuninkske - May-13-2020 hello, i don't know if your pdf has other text between the URL's otherwise you can take all text from the pdf in one line and then break up the line everythime you find 'http' or 'https' RE: Retrieve URL's from a pdf - AjayBachu - May-14-2020 Yes Pdf has other text data as well. Thanks for your reply.. let me know if you have any other way to fetch urls embedded in pdf along with other text Extract Multiline URL's from pdf/text file - AjayBachu - May-15-2020 I have a file with list of url's along with other data. We need to extract URL's. I tried URLExtract but its not fetching more then one line. RE: Retrieve URL's from a pdf - AjayBachu - May-15-2020 how to split the line when we find HTTP or https? RE: Retrieve URL's from a pdf - keuninkske - May-17-2020 have you googled your latest question already? string.split('http') RE: Retrieve URL's from a pdf - AjayBachu - May-18-2020 If I do like this, http is getting removed from response. string.split('http') RE: Retrieve URL's from a pdf - keuninkske - May-18-2020 easy to add......., you know exactly what is removed while searching and learning this thing you will find your way trough python at first it will be like this one, not perfect, but thats the only beginning of a learning curve asking every easy question without searching yourself or with beeing to precise and not beeing solution minded you will never get forward on the learning curve RE: Retrieve URL's from a pdf - AjayBachu - May-19-2020 Thanks for your advice and I am not reaching here without doing prior exercise from my end. After doing research from my end, I am posting here. You can reply me, if you have solution otherwise no issues. |