Python Forum
Problem with for loop - Printable Version

+- Python Forum (https://python-forum.io)
+-- Forum: Python Coding (https://python-forum.io/forum-7.html)
+--- Forum: Web Scraping & Web Development (https://python-forum.io/forum-13.html)
+--- Thread: Problem with for loop (/thread-11445.html)

Pages: 1 2


Problem with for loop - Blue Dog - Jul-09-2018

Here we go, I using a text file with url in it, one per line. I am using a for loop to go one line at a time through the list. Then I am feeding that for loop into an email scraping
code. I get about the first url then an error.
Here is my code

import urllib.request, re

url_files = open("test.txt", "r")
lines = url_files.readlines()
for line in lines:
      
        text = urllib.request.urlopen(line).read().decode('utf-8')
        regex = re.compile(r'[\w.-]+@[\w.-]+')
        email = re.findall(regex, text)
        print(email)

url_files.close
Here is the error, you can see the emails from the first url, then the error

['[email protected]', '[email protected]', '[email protected]']
Traceback (most recent call last):
File "D:\Users\computer\Desktop\python use file\read_fil_and_use_for_loop.py", line 7, in <module>
text = urllib.request.urlopen(line).read().decode('utf-8')
File "D:\Python37\lib\urllib\request.py", line 222, in urlopen
return opener.open(url, data, timeout)
File "D:\Python37\lib\urllib\request.py", line 510, in open
req = Request(fullurl, data)
File "D:\Python37\lib\urllib\request.py", line 328, in __init__
self.full_url = url
File "D:\Python37\lib\urllib\request.py", line 354, in full_url
self._parse()
File "D:\Python37\lib\urllib\request.py", line 383, in _parse
raise ValueError("unknown url type: %r" % self.full_url)
ValueError: unknown url type: ''

I have just got in to using for loop, so I am not very good at them yet.
Hope some one can help.
renny
ps. the list

http://www.cityofalbionmi.gov/
http://www.revize.com
http://www.cityofalbionmi.gov/government/mayor_and_city_council/index.php
http://www.cityofalbionmi.gov/government/city_boards_commissions_and_committees/index.php
http://www.cityofalbionmi.gov/residents/city_services/index.php
http://www.cityofalbionmi.gov/government/state_and_county_government/indedepartments/index.php
http://www.cityofalbionmi.gov/departments/administration.php
http://www.cityofalbionmi.gov/departments/assessing/index.php
http://www.cityofalbionmi.gov/departments/city_attorney/index.php
http://www.cityofalbionmi.gov/departments/city_clerk/index.php
http://www.cityofalbionmi.gov/departments/city_manager/index.php
http://www.cityofalbionmi.gov/departments/code_enforcement_and_building_safety/index.php
http://www.cityofalbionmi.gov/departments/finance_and_treasury/index.php
http://www.cityofalbionmi.gov/departments/human_resources/index.php
http://www.cityofalbionmi.gov/departments/planning_and_zoning/index.php
http://www.cityofalbionmi.gov/departments/public_safety/index.php
http://www.cityofalbionmi.gov/departments/public_services/index.php
http://www.cityofalbionmi.gov/departments/recreation/index.php
http://www.cityofalbionmi.gov/departments/street_department/index.php
http://www.cityofalbionmi.gov/departments/water_and_sewer.php
http://www.cityofalbionmi.gov/residents/index.php
http://albionmich.com/calendar/index.html


RE: Peoblem with for loop - buran - Jul-09-2018

Can you run
import urllib.request, re
 
with open("test.txt", "r") as url_file:
    for line in url_file:
        line = line.strip()
        print(line)
        text = urllib.request.urlopen(line).read().decode('utf-8')
this way we shall see which url creates a problem, i.e. do you get to the end of the list. But it looks like you have empty lines at the end of the urls file...

note the changes I have made and which you should keep.
Also parsing html with regex is big NO. HTML should not be used with regex read You can't parse [X]HTML with regex


RE: Problem with for loop - Blue Dog - Jul-09-2018

Thanks, what does line.strip Do?


RE: Problem with for loop - buran - Jul-09-2018

Removes whitespace from both ends of the string. In your case it's just '\n'
However I think ut's the blank lines at the end that makes problem


RE: Problem with for loop - Blue Dog - Jul-09-2018

ok, I did read HTML with regex, a lot of reading. What should I use?
I also see that I am going to need some type of code that will bypass
bad url. I thing try and except will do it. I have not used try yet but I see I am going have to learn it now.
thank you for your time and effort

I had a bad url, when I remove that it ran the list to the end.

I did remove the blank line too.


RE: Problem with for loop - buran - Jul-09-2018

by the way no need to use strip(). it works also with '\n' at the end.
you should use parsers like BeautifulSoup or lxml to parse HTML
Also you may want to look at Requests


RE: Problem with for loop - Blue Dog - Jul-09-2018

ok I use BeautifulSoup,lxml and Requests a lot, for scraping.
Thank you
You have been a great help
renny


RE: Problem with for loop - nilamo - Jul-09-2018

(Jul-09-2018, 03:21 PM)Blue Dog Wrote: url_files.close

fyi, to call a function, you need parenthasis: url_files.close(). As it is, you never close your file.


RE: Problem with for loop - Blue Dog - Jul-10-2018

Thank you nilamo, I will fix that as soon as I get off from here.
Now, buran here is my first try at not using rex.

[Image: kali.png]


Can you tell what this error is and how to fix it.

here is the old email scraper

import urllib.request, re
try:  
        with open("test.txt", "r") as url_file:
                for line in url_file:
                        line = line.strip()
                        print(line)
                        text = urllib.request.urlopen(line).read().decode('utf-8')
                        regex = re.compile(r'[\w.-]+@[\w.-]+')
                        email = re.findall(regex, text)
                        print(email)
except:
        pass
 
url_file.close()
        
I am working on a new one not using rex. If anyone want to use this code you are wecome to.
renny


RE: Problem with for loop - buran - Jul-10-2018

(Jul-10-2018, 01:31 AM)Blue Dog Wrote: Now, buran here is my first try at not using rex.
well, clearly you still use RegEx in that one....
Please, always copy/paste code, traceback, output, etc. and use proper tags. Don't post images of code, traceback, etc.

Now to the error - sys.argv[1] means you expect that user pass something (the url) when they run the script, e.g.
$ python new_email_graber.py www.mysite.com. sys.argv[0] is the script name