Python Forum

Full Version: urlparse to urllib.parse - the script stopped working
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
dear community


The following code runned - like a charme - all is nice. Very well. in python version 2.xy


import urllib
import urlparse
import re

url = "http://search.cpan.org/author/?W"
html = urllib.urlopen(url).read()
for lk, capname, name in re.findall('<a href="(/~.*?/)"><b>(.*?)</b></
a><br/><small>(.*?)</small>', html):
    alk = urlparse.urljoin(url, lk)

    data = { 'url':alk, 'name':name, 'cname':capname }

    phtml = urllib.urlopen(alk).read()
    memail = re.search('<a href="mailto:(.*?)">', phtml)
    if memail:
        data['email'] = memail.group(1)

    print data
i got back the following


    
    IndentationError: Missing parentheses in call to 'print'
>>> 
>>> import urllib
>>> import urllib.parse
>>> import re
>>> 
>>> url = "http://search.cpan.org/author/?W"
>>> html = urllib.urlopen(url).read()
Traceback (innermost last):
  File "<stdin>", line 1, in <module>
AttributeError: 'module' object has no attribute 'urlopen'
>>> for lk, capname, name in re.findall('<a href="(/~.*?/)"><b>(.*?)</b></
  File "<stdin>", line 1
    for lk, capname, name in re.findall('<a href="(/~.*?/)"><b>(.*?)</b></
                                                                         ^
SyntaxError: EOL while scanning string literal
>>> a><br/><small>(.*?)</small>', html):
  File "<stdin>", line 1
    a><br/><small>(.*?)</small>', html):
      ^
SyntaxError: invalid syntax
>>>     alk = urlparse.urljoin(url, lk)
  File "<stdin>", line 1
    alk = urlparse.urljoin(url, lk)
    ^
IndentationError: unexpected indent
>>> 
>>>     data = { 'url':alk, 'name':name, 'cname':capname }
  File "<stdin>", line 1
    data = { 'url':alk, 'name':name, 'cname':capname }
    ^
IndentationError: unexpected indent
>>> 
>>>     phtml = urllib.urlopen(alk).read()
  File "<stdin>", line 1
    phtml = urllib.urlopen(alk).read()
    ^
IndentationError: unexpected indent
>>>     memail = re.search('<a href="mailto:(.*?)">', phtml)
  File "<stdin>", line 1
    memail = re.search('<a href="mailto:(.*?)">', phtml)
    ^
IndentationError: unexpected indent
>>>     if memail:
  File "<stdin>", line 1
    if memail:
    ^
IndentationError: unexpected indent
>>>         data['email'] = memail.group(1)
  File "<stdin>", line 1
    data['email'] = memail.group(1)
    ^
IndentationError: unexpected indent
>>> 
>>>     print data
  File "<stdin>", line 1
    print data
    ^
IndentationError: Missing parentheses in call to 'print'
>>> 
okay - first of all i have to install the urllib.parse module
but i guess that there are some other errors waiting at the fence ...
In Python 3 the print is not a statement but a function so in line 18 you have to close data in parenthesis: print(data)
>>> import urllib
>>> import urllib.parse
>>> import re
>>> 
>>> url = "http://search.cpan.org/author/?W"
>>> html = urllib.urlopen(url).read()
Traceback (innermost last):
  File "<stdin>", line 1, in <module>
AttributeError: 'module' object has no attribute 'urlopen'
In Python 3 urlopen is in urllib.request module.(on line 6)

html = urllib.request.urlopen(url).read()
hello dear both,

many thanks - i got the following results....

   >>> import urllib
     ^
SyntaxError: invalid syntax
>>> >>> import urllib.parse
>>> >>> import re
>>> >>> 
>>> >>> url = "http://search.cpan.org/author/?W"
>>> >>> html = urllib.urlopen(url).read()
Traceback (innermost last):
  File "<stdin>", line 1, in <module>
AttributeError: 'module' object has no attribute 'urlopen'
>>> Traceback (innermost last):
  File "<stdin>", line 1
    Traceback (innermost last):
                            ^
SyntaxError: invalid syntax
>>>   File "<stdin>", line 1, in <module>
  File "<stdin>", line 1
    File "<stdin>", line 1, in <module>
    ^
IndentationError: unexpected indent
>>> AttributeError: 'module' object has no attribute 'urlopen'
  File "<stdin>", line 1
    AttributeError: 'module' object has no attribute 'urlopen'
                  ^
SyntaxError: invalid syntax
>>> 
>>> >>> import urllib
  File "<stdin>", line 1
    >>> import urllib
     ^
SyntaxError: invalid syntax
>>> >>> import urllib.parse
>>> >>> import re
>>> >>> 
>>> >>> url = "http://search.cpan.org/author/?W"
>>> >>> html = urllib.urlopen(url).read()
Traceback (innermost last):
  File "<stdin>", line 1, in <module>
AttributeError: 'module' object has no attribute 'urlopen'
>>> Traceback (innermost last):
  File "<stdin>", line 1
    Traceback (innermost last):
                            ^
SyntaxError: invalid syntax
>>>   File "<stdin>", line 1, in <module>
  File "<stdin>", line 1
    File "<stdin>", line 1, in <module>
    ^
IndentationError: unexpected indent
>>> AttributeError: 'module' object has no attribute 'urlopen'
Read my previous comment. you're using urllib.urlopen() but in python 3 its urllib.request.urlopen.

So correct code would look like :
>>> import urllib.request
>>> url = "http://search.cpan.org/author/?W"
>>> html = urllib.request.urlopen(url).read()
Secondly, do not copy paste the whole code in interpreter at once you'll lose indentation and get errors. Copy one line at a time or run it through a .py file.
hello dear all,

many thanks for the hints - very supportive. with the above mentioned example i want to dive into real world topics of programming.


1. parsing
2. storing (in a database)

with the following fix of the threadstart posting i had luck in the Python-2xy environment:

Note: since i have on my linux box installed Python 3.4xy i needed a quick test on a 2xy testbed: I found one here: https://www.tutorialspoint.com/execute_p...online.php

import urllib
import urlparse
import re
 
url = "http://search.cpan.org/author/?W"
html = urllib.urlopen(url).read()
for lk, capname, name in re.findall('<a href="(/~.*?/)"><b>(.*?)</b></a><br/><small>(.*?)</small>', html):
    alk = urlparse.urljoin(url, lk)
 
    data = { 'url':alk, 'name':name, 'cname':capname }
 
    phtml = urllib.urlopen(alk).read()
    memail = re.search('<a href="mailto:(.*?)">', phtml)
    if memail:data['email'] = memail.group(1)
 
    print data
the result looks like the following:

{'url': 'http://search.cpan.org/~wizeazz/', 'cname': 'WIZEAZZ', 'name': 'P. Verbaarschott', 'email': 'razor_mail%40yahoo.com'}
{'url': 'http://search.cpan.org/~wjblack/', 'cname': 'WJBLACK', 'name': 'William J. Black', 'email': 'bj%40wjblack.com'}
and like i said above - the results i want to store in a db - using peewee the db-abstraction model..

btw: this is another question (that has nothing to do with the parsing of retrived tata
- i need to do this at the weekend - guess that i should do this with the folling approach...


from peewee import *
import json

db = MySQLDatabase('mydb', user='john',passwd='mypass')

class User(Model):
    name = TextField()
    name2 = TextField()
    email_address = TextField()
    url = TextField()

    class Meta:
        database = db # this model uses the mydb database

User.create_table() #ensure table is created

data = json.load() #your json data file here

for entry in data: #assuming your data is an array of JSON objects
    user = User.create(name=entry["name"], name2=entry["name2"],
        email_address=entry["email-adress"], url=entry["url"])
    user.save()
again - Many thanks for your continued help

greetings apollo ;)