Python Forum

Full Version: Problem with for loop
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Pages: 1 2
Here we go, I using a text file with url in it, one per line. I am using a for loop to go one line at a time through the list. Then I am feeding that for loop into an email scraping
code. I get about the first url then an error.
Here is my code

import urllib.request, re

url_files = open("test.txt", "r")
lines = url_files.readlines()
for line in lines:
      
        text = urllib.request.urlopen(line).read().decode('utf-8')
        regex = re.compile(r'[\w.-]+@[\w.-]+')
        email = re.findall(regex, text)
        print(email)

url_files.close
Here is the error, you can see the emails from the first url, then the error

['[email protected]', '[email protected]', '[email protected]']
Traceback (most recent call last):
File "D:\Users\computer\Desktop\python use file\read_fil_and_use_for_loop.py", line 7, in <module>
text = urllib.request.urlopen(line).read().decode('utf-8')
File "D:\Python37\lib\urllib\request.py", line 222, in urlopen
return opener.open(url, data, timeout)
File "D:\Python37\lib\urllib\request.py", line 510, in open
req = Request(fullurl, data)
File "D:\Python37\lib\urllib\request.py", line 328, in __init__
self.full_url = url
File "D:\Python37\lib\urllib\request.py", line 354, in full_url
self._parse()
File "D:\Python37\lib\urllib\request.py", line 383, in _parse
raise ValueError("unknown url type: %r" % self.full_url)
ValueError: unknown url type: ''

I have just got in to using for loop, so I am not very good at them yet.
Hope some one can help.
renny
ps. the list

http://www.cityofalbionmi.gov/
http://www.revize.com
http://www.cityofalbionmi.gov/government.../index.php
http://www.cityofalbionmi.gov/government.../index.php
http://www.cityofalbionmi.gov/residents/.../index.php
http://www.cityofalbionmi.gov/government.../index.php
http://www.cityofalbionmi.gov/department...ration.php
http://www.cityofalbionmi.gov/department.../index.php
http://www.cityofalbionmi.gov/department.../index.php
http://www.cityofalbionmi.gov/department.../index.php
http://www.cityofalbionmi.gov/department.../index.php
http://www.cityofalbionmi.gov/department.../index.php
http://www.cityofalbionmi.gov/department.../index.php
http://www.cityofalbionmi.gov/department.../index.php
http://www.cityofalbionmi.gov/department.../index.php
http://www.cityofalbionmi.gov/department.../index.php
http://www.cityofalbionmi.gov/department.../index.php
http://www.cityofalbionmi.gov/department.../index.php
http://www.cityofalbionmi.gov/department.../index.php
http://www.cityofalbionmi.gov/department..._sewer.php
http://www.cityofalbionmi.gov/residents/index.php
http://albionmich.com/calendar/index.html
Can you run
import urllib.request, re
 
with open("test.txt", "r") as url_file:
    for line in url_file:
        line = line.strip()
        print(line)
        text = urllib.request.urlopen(line).read().decode('utf-8')
this way we shall see which url creates a problem, i.e. do you get to the end of the list. But it looks like you have empty lines at the end of the urls file...

note the changes I have made and which you should keep.
Also parsing html with regex is big NO. HTML should not be used with regex read You can't parse [X]HTML with regex
Thanks, what does line.strip Do?
Removes whitespace from both ends of the string. In your case it's just '\n'
However I think ut's the blank lines at the end that makes problem
ok, I did read HTML with regex, a lot of reading. What should I use?
I also see that I am going to need some type of code that will bypass
bad url. I thing try and except will do it. I have not used try yet but I see I am going have to learn it now.
thank you for your time and effort

I had a bad url, when I remove that it ran the list to the end.

I did remove the blank line too.
by the way no need to use strip(). it works also with '\n' at the end.
you should use parsers like BeautifulSoup or lxml to parse HTML
Also you may want to look at Requests
ok I use BeautifulSoup,lxml and Requests a lot, for scraping.
Thank you
You have been a great help
renny
(Jul-09-2018, 03:21 PM)Blue Dog Wrote: [ -> ]url_files.close

fyi, to call a function, you need parenthasis: url_files.close(). As it is, you never close your file.
Thank you nilamo, I will fix that as soon as I get off from here.
Now, buran here is my first try at not using rex.

[Image: kali.png]


Can you tell what this error is and how to fix it.

here is the old email scraper

import urllib.request, re
try:  
        with open("test.txt", "r") as url_file:
                for line in url_file:
                        line = line.strip()
                        print(line)
                        text = urllib.request.urlopen(line).read().decode('utf-8')
                        regex = re.compile(r'[\w.-]+@[\w.-]+')
                        email = re.findall(regex, text)
                        print(email)
except:
        pass
 
url_file.close()
        
I am working on a new one not using rex. If anyone want to use this code you are wecome to.
renny
(Jul-10-2018, 01:31 AM)Blue Dog Wrote: [ -> ]Now, buran here is my first try at not using rex.
well, clearly you still use RegEx in that one....
Please, always copy/paste code, traceback, output, etc. and use proper tags. Don't post images of code, traceback, etc.

Now to the error - sys.argv[1] means you expect that user pass something (the url) when they run the script, e.g.
$ python new_email_graber.py www.mysite.com. sys.argv[0] is the script name
Pages: 1 2