Python Forum
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Problem with for loop
#1
Here we go, I using a text file with url in it, one per line. I am using a for loop to go one line at a time through the list. Then I am feeding that for loop into an email scraping
code. I get about the first url then an error.
Here is my code

import urllib.request, re

url_files = open("test.txt", "r")
lines = url_files.readlines()
for line in lines:
      
        text = urllib.request.urlopen(line).read().decode('utf-8')
        regex = re.compile(r'[\w.-]+@[\w.-]+')
        email = re.findall(regex, text)
        print(email)

url_files.close
Here is the error, you can see the emails from the first url, then the error

['[email protected]', '[email protected]', '[email protected]']
Traceback (most recent call last):
File "D:\Users\computer\Desktop\python use file\read_fil_and_use_for_loop.py", line 7, in <module>
text = urllib.request.urlopen(line).read().decode('utf-8')
File "D:\Python37\lib\urllib\request.py", line 222, in urlopen
return opener.open(url, data, timeout)
File "D:\Python37\lib\urllib\request.py", line 510, in open
req = Request(fullurl, data)
File "D:\Python37\lib\urllib\request.py", line 328, in __init__
self.full_url = url
File "D:\Python37\lib\urllib\request.py", line 354, in full_url
self._parse()
File "D:\Python37\lib\urllib\request.py", line 383, in _parse
raise ValueError("unknown url type: %r" % self.full_url)
ValueError: unknown url type: ''

I have just got in to using for loop, so I am not very good at them yet.
Hope some one can help.
renny
ps. the list

http://www.cityofalbionmi.gov/
http://www.revize.com
http://www.cityofalbionmi.gov/government.../index.php
http://www.cityofalbionmi.gov/government.../index.php
http://www.cityofalbionmi.gov/residents/.../index.php
http://www.cityofalbionmi.gov/government.../index.php
http://www.cityofalbionmi.gov/department...ration.php
http://www.cityofalbionmi.gov/department.../index.php
http://www.cityofalbionmi.gov/department.../index.php
http://www.cityofalbionmi.gov/department.../index.php
http://www.cityofalbionmi.gov/department.../index.php
http://www.cityofalbionmi.gov/department.../index.php
http://www.cityofalbionmi.gov/department.../index.php
http://www.cityofalbionmi.gov/department.../index.php
http://www.cityofalbionmi.gov/department.../index.php
http://www.cityofalbionmi.gov/department.../index.php
http://www.cityofalbionmi.gov/department.../index.php
http://www.cityofalbionmi.gov/department.../index.php
http://www.cityofalbionmi.gov/department.../index.php
http://www.cityofalbionmi.gov/department..._sewer.php
http://www.cityofalbionmi.gov/residents/index.php
http://albionmich.com/calendar/index.html
Reply
#2
Can you run
import urllib.request, re
 
with open("test.txt", "r") as url_file:
    for line in url_file:
        line = line.strip()
        print(line)
        text = urllib.request.urlopen(line).read().decode('utf-8')
this way we shall see which url creates a problem, i.e. do you get to the end of the list. But it looks like you have empty lines at the end of the urls file...

note the changes I have made and which you should keep.
Also parsing html with regex is big NO. HTML should not be used with regex read You can't parse [X]HTML with regex
If you can't explain it to a six year old, you don't understand it yourself, Albert Einstein
How to Ask Questions The Smart Way: link and another link
Create MCV example
Debug small programs

Reply
#3
Thanks, what does line.strip Do?
Reply
#4
Removes whitespace from both ends of the string. In your case it's just '\n'
However I think ut's the blank lines at the end that makes problem
If you can't explain it to a six year old, you don't understand it yourself, Albert Einstein
How to Ask Questions The Smart Way: link and another link
Create MCV example
Debug small programs

Reply
#5
ok, I did read HTML with regex, a lot of reading. What should I use?
I also see that I am going to need some type of code that will bypass
bad url. I thing try and except will do it. I have not used try yet but I see I am going have to learn it now.
thank you for your time and effort

I had a bad url, when I remove that it ran the list to the end.

I did remove the blank line too.
Reply
#6
by the way no need to use strip(). it works also with '\n' at the end.
you should use parsers like BeautifulSoup or lxml to parse HTML
Also you may want to look at Requests
If you can't explain it to a six year old, you don't understand it yourself, Albert Einstein
How to Ask Questions The Smart Way: link and another link
Create MCV example
Debug small programs

Reply
#7
ok I use BeautifulSoup,lxml and Requests a lot, for scraping.
Thank you
You have been a great help
renny
Reply
#8
(Jul-09-2018, 03:21 PM)Blue Dog Wrote: url_files.close

fyi, to call a function, you need parenthasis: url_files.close(). As it is, you never close your file.
Reply
#9
Thank you nilamo, I will fix that as soon as I get off from here.
Now, buran here is my first try at not using rex.

[Image: kali.png]


Can you tell what this error is and how to fix it.

here is the old email scraper

import urllib.request, re
try:  
        with open("test.txt", "r") as url_file:
                for line in url_file:
                        line = line.strip()
                        print(line)
                        text = urllib.request.urlopen(line).read().decode('utf-8')
                        regex = re.compile(r'[\w.-]+@[\w.-]+')
                        email = re.findall(regex, text)
                        print(email)
except:
        pass
 
url_file.close()
        
I am working on a new one not using rex. If anyone want to use this code you are wecome to.
renny
Reply
#10
(Jul-10-2018, 01:31 AM)Blue Dog Wrote: Now, buran here is my first try at not using rex.
well, clearly you still use RegEx in that one....
Please, always copy/paste code, traceback, output, etc. and use proper tags. Don't post images of code, traceback, etc.

Now to the error - sys.argv[1] means you expect that user pass something (the url) when they run the script, e.g.
$ python new_email_graber.py www.mysite.com. sys.argv[0] is the script name
If you can't explain it to a six year old, you don't understand it yourself, Albert Einstein
How to Ask Questions The Smart Way: link and another link
Create MCV example
Debug small programs

Reply


Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020