Problem with for loop - Printable Version +- Python Forum (https://python-forum.io) +-- Forum: Python Coding (https://python-forum.io/forum-7.html) +--- Forum: Web Scraping & Web Development (https://python-forum.io/forum-13.html) +--- Thread: Problem with for loop (/thread-11445.html) Pages:
1
2
|
Problem with for loop - Blue Dog - Jul-09-2018 Here we go, I using a text file with url in it, one per line. I am using a for loop to go one line at a time through the list. Then I am feeding that for loop into an email scraping code. I get about the first url then an error. Here is my code import urllib.request, re url_files = open("test.txt", "r") lines = url_files.readlines() for line in lines: text = urllib.request.urlopen(line).read().decode('utf-8') regex = re.compile(r'[\w.-]+@[\w.-]+') email = re.findall(regex, text) print(email) url_files.closeHere is the error, you can see the emails from the first url, then the error ['[email protected]', '[email protected]', '[email protected]'] Traceback (most recent call last): File "D:\Users\computer\Desktop\python use file\read_fil_and_use_for_loop.py", line 7, in <module> text = urllib.request.urlopen(line).read().decode('utf-8') File "D:\Python37\lib\urllib\request.py", line 222, in urlopen return opener.open(url, data, timeout) File "D:\Python37\lib\urllib\request.py", line 510, in open req = Request(fullurl, data) File "D:\Python37\lib\urllib\request.py", line 328, in __init__ self.full_url = url File "D:\Python37\lib\urllib\request.py", line 354, in full_url self._parse() File "D:\Python37\lib\urllib\request.py", line 383, in _parse raise ValueError("unknown url type: %r" % self.full_url) ValueError: unknown url type: '' I have just got in to using for loop, so I am not very good at them yet. Hope some one can help. renny ps. the list http://www.cityofalbionmi.gov/ http://www.revize.com http://www.cityofalbionmi.gov/government/mayor_and_city_council/index.php http://www.cityofalbionmi.gov/government/city_boards_commissions_and_committees/index.php http://www.cityofalbionmi.gov/residents/city_services/index.php http://www.cityofalbionmi.gov/government/state_and_county_government/indedepartments/index.php http://www.cityofalbionmi.gov/departments/administration.php http://www.cityofalbionmi.gov/departments/assessing/index.php http://www.cityofalbionmi.gov/departments/city_attorney/index.php http://www.cityofalbionmi.gov/departments/city_clerk/index.php http://www.cityofalbionmi.gov/departments/city_manager/index.php http://www.cityofalbionmi.gov/departments/code_enforcement_and_building_safety/index.php http://www.cityofalbionmi.gov/departments/finance_and_treasury/index.php http://www.cityofalbionmi.gov/departments/human_resources/index.php http://www.cityofalbionmi.gov/departments/planning_and_zoning/index.php http://www.cityofalbionmi.gov/departments/public_safety/index.php http://www.cityofalbionmi.gov/departments/public_services/index.php http://www.cityofalbionmi.gov/departments/recreation/index.php http://www.cityofalbionmi.gov/departments/street_department/index.php http://www.cityofalbionmi.gov/departments/water_and_sewer.php http://www.cityofalbionmi.gov/residents/index.php http://albionmich.com/calendar/index.html RE: Peoblem with for loop - buran - Jul-09-2018 Can you run import urllib.request, re with open("test.txt", "r") as url_file: for line in url_file: line = line.strip() print(line) text = urllib.request.urlopen(line).read().decode('utf-8')this way we shall see which url creates a problem, i.e. do you get to the end of the list. But it looks like you have empty lines at the end of the urls file... note the changes I have made and which you should keep. Also parsing html with regex is big NO. HTML should not be used with regex read You can't parse [X]HTML with regex RE: Problem with for loop - Blue Dog - Jul-09-2018 Thanks, what does line.strip Do? RE: Problem with for loop - buran - Jul-09-2018 Removes whitespace from both ends of the string. In your case it's just '\n' However I think ut's the blank lines at the end that makes problem RE: Problem with for loop - Blue Dog - Jul-09-2018 ok, I did read HTML with regex, a lot of reading. What should I use? I also see that I am going to need some type of code that will bypass bad url. I thing try and except will do it. I have not used try yet but I see I am going have to learn it now. thank you for your time and effort I had a bad url, when I remove that it ran the list to the end. I did remove the blank line too. RE: Problem with for loop - buran - Jul-09-2018 by the way no need to use strip() . it works also with '\n' at the end.you should use parsers like BeautifulSoup or lxml to parse HTML Also you may want to look at Requests RE: Problem with for loop - Blue Dog - Jul-09-2018 ok I use BeautifulSoup,lxml and Requests a lot, for scraping. Thank you You have been a great help renny RE: Problem with for loop - nilamo - Jul-09-2018 (Jul-09-2018, 03:21 PM)Blue Dog Wrote: url_files.close fyi, to call a function, you need parenthasis: url_files.close() . As it is, you never close your file.
RE: Problem with for loop - Blue Dog - Jul-10-2018 Thank you nilamo, I will fix that as soon as I get off from here. Now, buran here is my first try at not using rex. [Image: kali.png] Can you tell what this error is and how to fix it. here is the old email scraper import urllib.request, re try: with open("test.txt", "r") as url_file: for line in url_file: line = line.strip() print(line) text = urllib.request.urlopen(line).read().decode('utf-8') regex = re.compile(r'[\w.-]+@[\w.-]+') email = re.findall(regex, text) print(email) except: pass url_file.close()I am working on a new one not using rex. If anyone want to use this code you are wecome to. renny RE: Problem with for loop - buran - Jul-10-2018 (Jul-10-2018, 01:31 AM)Blue Dog Wrote: Now, buran here is my first try at not using rex.well, clearly you still use RegEx in that one.... Please, always copy/paste code, traceback, output, etc. and use proper tags. Don't post images of code, traceback, etc. Now to the error - sys.argv[1] means you expect that user pass something (the url) when they run the script, e.g.$ python new_email_graber.py www.mysite.com . sys.argv[0] is the script name
|