(Feb-08-2019, 02:34 AM)s_o_what Wrote: No idea what this does.......It takes url address out of link in HTML source code.
urls = re.findall('(?:(?:https?|ftp):\/\/)?[\w/\-?=%.]+\.[\w/\-?=%.]+', read_file)
A better way always when dealing with HTML/XML is to use a Parser.
Example:
urls.txt
Output:<a href="https://python-forum.io/" target="_blank">Visit Python Forum</a>
<li class="tier-2" role="treeitem"><a href="http://docs.python.org/devguide/">Developer Guide</a></li>
<a href="ftp://theftpserver.com/files/acounts.pdf">Download file</a>
from bs4 import BeautifulSoup soup = BeautifulSoup(open('urls.txt', encoding='utf-8'), 'lxml') links = soup.find_all('a', href=True) for url in links: print(url.get('href'))
Output:https://python-forum.io/
http://docs.python.org/devguide/
ftp://theftpserver.com/files/acounts.pdf
Web-Scraping part-1