Python Forum
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
html parser
#1
Hello - I'm working on the book "Web Scraping with Python" b Ryan Mitchell 2015.

I finally just decided to pick one and jump in so here I am. I've got the basics (I think). There is still much to learn I'm sure. Here's my current issue...

from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen("http://www.pythonscraping.com/pages/page1.html")
bsOb = BeautifulSoup(html.read())
print(bsObj.h1)

This is the error I get...

Warning (from warnings module):
File "C:\Users\Admin\AppData\Local\Programs\Python\Python36\lib\site-packages\bs4\__init__.py", line 181
markup_type=markup_type))
UserWarning: No parser was explicitly specified, so I'm using the best available HTML parser for this system ("html.parser"). This usually isn't a problem, but if you run this code on another system, or in a different virtual environment, it may use a different parser and behave differently.

The code that caused this warning is on line 1 of the file <string>. To get rid of this warning, change code that looks like this:

BeautifulSoup(YOUR_MARKUP})

to this:

BeautifulSoup(YOUR_MARKUP, "html.parser")

Traceback (most recent call last):
File "C:\Python\Web Scraping pg 8.py", line 5, in <module>
print(bsObj.h1)
NameError: name 'bsObj' is not defined


from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen("http://www.pythonscraping.com/pages/page1.html")
bsOb = BeautifulSoup(html.read, html.parser)
print(bsObj.h1)

This is the error I get...

Traceback (most recent call last):
File "C:\Python\Web_Scraping_pg_8.py", line 4, in <module>
bsOb = BeautifulSoup(html.read, html.parser)
AttributeError: 'HTTPResponse' object has no attribute 'parser'


I think the reason I'm having the issue is because of the age of the book. Any help I can get would be most appreciated!

Thank you!! Smile
Reply
#2
(Mar-16-2018, 06:13 PM)tjnichols Wrote: I think the reason I'm having the issue is because of the age of the book. Any help I can get would be most appreciated!
Look at this more updated Web-Scraping part-1.
Reply
#3
(Mar-16-2018, 06:13 PM)tjnichols Wrote: bsOb = BeautifulSoup(html.read())
print(bsObj.h1)

You define a variable named bsOb, but try to use one named bsObj. Those are not the same thing.

(Mar-16-2018, 06:13 PM)tjnichols Wrote: BeautifulSoup(YOUR_MARKUP, "html.parser")
#snip
bsOb = BeautifulSoup(html.read, html.parser)
The message is very literal. html.parser isn't a thing that exists anywhere, but instead the string "html.parser" is a string that always exists.
Reply
#4
note the quotes around "html.parser"
Reply
#5
you should also use the requests module to get the html
Recommended Tutorials:
Reply
#6
nilamo [b' Wrote: [/b]347' dateline='1521228699']
(Mar-16-2018, 06:13 PM)tjnichols Wrote: bsOb = BeautifulSoup(html.read())
print(bsObj.h1)

You define a variable named bsOb, but try to use one named bsObj. Those are not the same thing. I don't understand why these aren't the same - is it simply because I typed it wrong? Or are they two separate things that do two separate things?

(Mar-16-2018, 06:13 PM)tjnichols Wrote: BeautifulSoup(YOUR_MARKUP, "html.parser")
#snip
bsOb = BeautifulSoup(html.read, html.parser)
The message is very literal. html.parser isn't a thing that exists anywhere, but instead the string "html.parser" is a string that always exists.Is this something I should always use? If so, why? What does it do for me?

I truly appreciate your time and patience with me and your ability to break things down so I can understand them! Thank you!

(Mar-16-2018, 07:38 PM)metulburr Wrote: you should also use the requests module to get the html
Can you tell me how using the requests module helps me the html? I appreciate your help!
Reply
#7
(Mar-17-2018, 05:08 PM)tjnichols Wrote: Can you tell me how using the requests module helps me the html? I appreciate your help!
Because it's a better and easier to use than urllib in all parts,eg you get correct encoding back and security is up to date.
Your script using Requests,if you look at link i gave you see use of Requests with BeautifulSoup and lxml.
import requests
from bs4 import BeautifulSoup

url = 'http://www.pythonscraping.com/pages/page1.html'
url_get = requests.get(url)
soup = BeautifulSoup(url_get.content, 'html.parser')
print(soup.find('title').text)
print(soup.find('h1').text)
Output:
A Useful Page An Interesting Title
Reply
#8
(Mar-17-2018, 05:55 PM)snippsat Wrote: [quote="tjnichols" pid="42398" dateline="1521306539"]Can you tell me how using the requests module helps me the html? I appreciate your help!
Because it's a better and easier to use than urllib in all parts,eg you get correct encoding back and security is up to date. Your script using Requests,if you look at link i gave you see use of Requests with BeautifulSoup and lxml.
import requests from bs4 import BeautifulSoup url = 'http://www.pythonscraping.com/pages/page1.html' url_get = requests.get(url) soup = BeautifulSoup(url_get.content, 'html.parser') print(soup.find('title').text) print(soup.find('h1').text)
Output:
A Useful Page An Interesting Title
[/quote

Thank you! That makes sense! Let me try itI
Reply
#9
(Mar-17-2018, 06:28 PM)tjnichols Wrote:
(Mar-17-2018, 05:55 PM)snippsat Wrote: [quote="tjnichols" pid="42398" dateline="1521306539"]Can you tell me how using the requests module helps me the html? I appreciate your help!
Because it's a better and easier to use than urllib in all parts,eg you get correct encoding back and security is up to date. Your script using Requests,if you look at link i gave you see use of Requests with BeautifulSoup and lxml.
import requests from bs4 import BeautifulSoup url = 'http://www.pythonscraping.com/pages/page1.html' url_get = requests.get(url) soup = BeautifulSoup(url_get.content, 'html.parser') print(soup.find('title').text) print(soup.find('h1').text)
Output:
A Useful Page An Interesting Title
[/quote Thank you! That makes sense! Let me try itI

Hey snippsat - I tried your code. Here is what I got...

import request
from bs4 import BeautifulSoup
url = 'http://www.pythonscraping.com/pages/page1.html'
url_get = requests.get(url) soup = BeautifulSoup(url_get.content, 'html.parser')
print(soup.find('title').text)
print(soup.find('h1').text)

The error...
SyntaxError: multiple statements found while compiling a single statement

I understand things may be different as in more secure etc, what I need to understand is why I am having the issues with what I've done.

I appreciate your help and I would like to understand your way of doing things like you've shown above. Can you give me a link on the "import requests" so I can read up on that? Also, can you point me to where I can find more information on the "urllib" you talked about?

Thank you!

Tonya
Reply
#10
My apologies snippsat! I should have looked at the links sooner. I appreciate your help!

Thank you!
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  HTML multi select HTML listbox with Flask/Python rfeyer 0 4,604 Mar-14-2021, 12:23 PM
Last Post: rfeyer
  Easy HTML Parser: Validating trs by attributes several tags deep? runswithascript 7 3,549 Aug-14-2020, 10:58 PM
Last Post: runswithascript
  Python3 + BeautifulSoup4 + lxml (HTML -> CSV) - How to loop to next HTML/new CSV Row BrandonKastning 0 2,350 Mar-22-2020, 06:10 AM
Last Post: BrandonKastning
  Django field model for HTML parser? Drone4four 0 4,135 Nov-15-2017, 02:43 AM
Last Post: Drone4four

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020