urlparse to urllib.parse - the script stopped working - Printable Version +- Python Forum (https://python-forum.io) +-- Forum: Python Coding (https://python-forum.io/forum-7.html) +--- Forum: General Coding Help (https://python-forum.io/forum-8.html) +--- Thread: urlparse to urllib.parse - the script stopped working (/thread-5852.html) |
urlparse to urllib.parse - the script stopped working - apollo - Oct-24-2017 dear community The following code runned - like a charme - all is nice. Very well. in python version 2.xy import urllib import urlparse import re url = "http://search.cpan.org/author/?W" html = urllib.urlopen(url).read() for lk, capname, name in re.findall('<a href="(/~.*?/)"><b>(.*?)</b></ a><br/><small>(.*?)</small>', html): alk = urlparse.urljoin(url, lk) data = { 'url':alk, 'name':name, 'cname':capname } phtml = urllib.urlopen(alk).read() memail = re.search('<a href="mailto:(.*?)">', phtml) if memail: data['email'] = memail.group(1) print datai got back the following IndentationError: Missing parentheses in call to 'print' >>> >>> import urllib >>> import urllib.parse >>> import re >>> >>> url = "http://search.cpan.org/author/?W" >>> html = urllib.urlopen(url).read() Traceback (innermost last): File "<stdin>", line 1, in <module> AttributeError: 'module' object has no attribute 'urlopen' >>> for lk, capname, name in re.findall('<a href="(/~.*?/)"><b>(.*?)</b></ File "<stdin>", line 1 for lk, capname, name in re.findall('<a href="(/~.*?/)"><b>(.*?)</b></ ^ SyntaxError: EOL while scanning string literal >>> a><br/><small>(.*?)</small>', html): File "<stdin>", line 1 a><br/><small>(.*?)</small>', html): ^ SyntaxError: invalid syntax >>> alk = urlparse.urljoin(url, lk) File "<stdin>", line 1 alk = urlparse.urljoin(url, lk) ^ IndentationError: unexpected indent >>> >>> data = { 'url':alk, 'name':name, 'cname':capname } File "<stdin>", line 1 data = { 'url':alk, 'name':name, 'cname':capname } ^ IndentationError: unexpected indent >>> >>> phtml = urllib.urlopen(alk).read() File "<stdin>", line 1 phtml = urllib.urlopen(alk).read() ^ IndentationError: unexpected indent >>> memail = re.search('<a href="mailto:(.*?)">', phtml) File "<stdin>", line 1 memail = re.search('<a href="mailto:(.*?)">', phtml) ^ IndentationError: unexpected indent >>> if memail: File "<stdin>", line 1 if memail: ^ IndentationError: unexpected indent >>> data['email'] = memail.group(1) File "<stdin>", line 1 data['email'] = memail.group(1) ^ IndentationError: unexpected indent >>> >>> print data File "<stdin>", line 1 print data ^ IndentationError: Missing parentheses in call to 'print' >>>okay - first of all i have to install the urllib.parse module but i guess that there are some other errors waiting at the fence ... RE: urlparse to urllib.parse - the script stopped working - wavic - Oct-24-2017 In Python 3 the print is not a statement but a function so in line 18 you have to close data in parenthesis: print(data) RE: urlparse to urllib.parse - the script stopped working - hbknjr - Oct-25-2017 >>> import urllib >>> import urllib.parse >>> import re >>> >>> url = "http://search.cpan.org/author/?W" >>> html = urllib.urlopen(url).read() Traceback (innermost last): File "<stdin>", line 1, in <module> AttributeError: 'module' object has no attribute 'urlopen'In Python 3 urlopen is in urllib.request module.(on line 6) html = urllib.request.urlopen(url).read() RE: urlparse to urllib.parse - the script stopped working - apollo - Oct-25-2017 hello dear both, many thanks - i got the following results.... >>> import urllib ^ SyntaxError: invalid syntax >>> >>> import urllib.parse >>> >>> import re >>> >>> >>> >>> url = "http://search.cpan.org/author/?W" >>> >>> html = urllib.urlopen(url).read() Traceback (innermost last): File "<stdin>", line 1, in <module> AttributeError: 'module' object has no attribute 'urlopen' >>> Traceback (innermost last): File "<stdin>", line 1 Traceback (innermost last): ^ SyntaxError: invalid syntax >>> File "<stdin>", line 1, in <module> File "<stdin>", line 1 File "<stdin>", line 1, in <module> ^ IndentationError: unexpected indent >>> AttributeError: 'module' object has no attribute 'urlopen' File "<stdin>", line 1 AttributeError: 'module' object has no attribute 'urlopen' ^ SyntaxError: invalid syntax >>> >>> >>> import urllib File "<stdin>", line 1 >>> import urllib ^ SyntaxError: invalid syntax >>> >>> import urllib.parse >>> >>> import re >>> >>> >>> >>> url = "http://search.cpan.org/author/?W" >>> >>> html = urllib.urlopen(url).read() Traceback (innermost last): File "<stdin>", line 1, in <module> AttributeError: 'module' object has no attribute 'urlopen' >>> Traceback (innermost last): File "<stdin>", line 1 Traceback (innermost last): ^ SyntaxError: invalid syntax >>> File "<stdin>", line 1, in <module> File "<stdin>", line 1 File "<stdin>", line 1, in <module> ^ IndentationError: unexpected indent >>> AttributeError: 'module' object has no attribute 'urlopen' RE: urlparse to urllib.parse - the script stopped working - hbknjr - Oct-25-2017 Read my previous comment. you're using urllib.urlopen() but in python 3 its urllib.request.urlopen .So correct code would look like : >>> import urllib.request >>> url = "http://search.cpan.org/author/?W" >>> html = urllib.request.urlopen(url).read()Secondly, do not copy paste the whole code in interpreter at once you'll lose indentation and get errors. Copy one line at a time or run it through a .py file. RE: urlparse to urllib.parse - the script stopped working - apollo - Oct-26-2017 hello dear all, many thanks for the hints - very supportive. with the above mentioned example i want to dive into real world topics of programming. 1. parsing 2. storing (in a database) with the following fix of the threadstart posting i had luck in the Python-2xy environment: Note: since i have on my linux box installed Python 3.4xy i needed a quick test on a 2xy testbed: I found one here: https://www.tutorialspoint.com/execute_python_online.php import urllib import urlparse import re url = "http://search.cpan.org/author/?W" html = urllib.urlopen(url).read() for lk, capname, name in re.findall('<a href="(/~.*?/)"><b>(.*?)</b></a><br/><small>(.*?)</small>', html): alk = urlparse.urljoin(url, lk) data = { 'url':alk, 'name':name, 'cname':capname } phtml = urllib.urlopen(alk).read() memail = re.search('<a href="mailto:(.*?)">', phtml) if memail:data['email'] = memail.group(1) print datathe result looks like the following: {'url': 'http://search.cpan.org/~wizeazz/', 'cname': 'WIZEAZZ', 'name': 'P. Verbaarschott', 'email': 'razor_mail%40yahoo.com'} {'url': 'http://search.cpan.org/~wjblack/', 'cname': 'WJBLACK', 'name': 'William J. Black', 'email': 'bj%40wjblack.com'}and like i said above - the results i want to store in a db - using peewee the db-abstraction model.. btw: this is another question (that has nothing to do with the parsing of retrived tata - i need to do this at the weekend - guess that i should do this with the folling approach... from peewee import * import json db = MySQLDatabase('mydb', user='john',passwd='mypass') class User(Model): name = TextField() name2 = TextField() email_address = TextField() url = TextField() class Meta: database = db # this model uses the mydb database User.create_table() #ensure table is created data = json.load() #your json data file here for entry in data: #assuming your data is an array of JSON objects user = User.create(name=entry["name"], name2=entry["name2"], email_address=entry["email-adress"], url=entry["url"]) user.save()again - Many thanks for your continued help greetings apollo ;) |