Update 1-4-2018
Library used Requests, lxml, BeautifuSoup.
These are better and more updated than what's available in standard library.
Jump strait into it and break it down later.
Using
With BeautifulSoup can plug in a faster parser like lxml(need to be installed).
Change line
Using BeautifulSoup new CSS selector
lxml
Using lxml alone which can use XPath syntax.
Using lxml CSS selector
Break down in smaller part where we see html.
Example of getting info from links.
Using CSS selectors this is new in BeautifulSoup 4
Using lxml XPath
Tool for inspecting web sites
Chrome DevTools and Firefox Developer Tools are good tool for inspecting all aspect of a web-site.
Use of
Execute
Type in
If there are matched elements, they will be highlighted in DOM.
XPath, CSS, DOM and Selenium cheat sheet
5 Best XPath Cheat Sheets and Quick References
I may add some to part-1.
- All tested Python 3.6.4
- Link to part-2(also updated with new stuff)
- All code can be copied to run
- Added lxml example
Library used Requests, lxml, BeautifuSoup.
pip install beautifulsoup4 requests lxml
These are better and more updated than what's available in standard library.
Jump strait into it and break it down later.
import requests from bs4 import BeautifulSoup url = 'http://CNN.com' response = requests.get(url) soup = BeautifulSoup(response.content, 'lxml') print(soup.find('title').text)
Output:CNN - Breaking News, U.S., World, Weather, Entertainment & Video News
Code is getting source code from CNN and find <title>
tag and return text.Using
html.parser
which is in standard library.With BeautifulSoup can plug in a faster parser like lxml(need to be installed).
Change line
6
to:soup = BeautifulSoup(url_get.content, 'lxml')
Using BeautifulSoup new CSS selector
from bs4 import BeautifulSoup import requests url = 'https://www.python.org/' resonse = requests.get(url) soup = BeautifulSoup(resonse.content, 'lxml') print(soup.select('head > title')[0].text)
Output:Welcome to Python.org
lxml
Using lxml alone which can use XPath syntax.
from lxml import html import requests url = 'https://www.python.org/' resonse = requests.get(url) tree = html.fromstring(resonse.content) lxml_soup = tree.xpath('/html/head/title/text()')[0] print(lxml_soup)
Output:Welcome to Python.org
Using lxml CSS selector
pip install cssselect
from lxml import html from lxml.cssselect import CSSSelector import requests url = 'https://www.python.org/' resonse = requests.get(url) tree = html.fromstring(resonse.content) lxml_soup = tree.cssselect('head > title') print(lxml_soup[0].text)
Output:Welcome to Python.org
Break down in smaller part where we see html.
find()
find first <title>
tagfind_all()
find all <title>
tag and return a list.from bs4 import BeautifulSoup # Simulate a web page html = '''\ <html> <head> <title>My Site</title> </head> <body> <title>First chapter</title> <p>Page1</p> <p>Page2</p> </body> </html>''' soup = BeautifulSoup(html, 'html.parser') print(soup.find('title')) print(soup.find_all('title'))
Output:<title>My Site</title>
[<title>My Site</title>, <title>First chapter</title>]
Using text()
to get clean info out of tag.print(soup.find('title').text) print(soup.find_all('title')[1].text)
Output:My Site
First chapter
Example of getting info from links.
from bs4 import BeautifulSoup # Simulate a web page html = '''\ <body> <div id='images'> <a href='image1.html'>Name: My image 1 <br /><img src='image1_thumb.jpg' /></a> <a href='image2.html'>Name: My image 2 <br /><img src='image2_thumb.jpg' /></a> <a href='image3.html'>Name: My image 3 <br /><img src='image3_thumb.jpg' /></a> </div> </body> ''' soup = BeautifulSoup(html, 'html.parser') print([link.get('href') for link in soup.find_all('a')])Here we getting attribute
href
inside <a>
tag.Output:['image1.html', 'image2.html', 'image3.html']
Here we are getting attribute src
inside <a>
tag.print([link.get('src') for link in soup.find_all('img')])
Output:['image1_thumb.jpg', 'image2_thumb.jpg', 'image3_thumb.jpg']
Using CSS selectors this is new in BeautifulSoup 4
from bs4 import BeautifulSoup # Simulate a web page html = '''\ <body> <div id='images'> <a href='image1.html'>My image 1 <br /><img src='image1_thumb.jpg' /></a> </div> <div> <p class="car"> <a class="bmw" href="Link to bmw"></a> <a class="opel" href="Link to opel"></a> </p> <p class="end"> all cars are great </p> </div> ''' soup = BeautifulSoup(html, 'html.parser')
# All a tag that's inside p print(soup.select("p > a"))
Output:[<a class="bmw" href="Link to bmw"></a>, <a class="ople" href="Link to opel"></a>]
---# Select href attribute that begins with image print(soup.select('a[href^="image"]'))
Output:[<a href="image1.html">Name: My image 1 <br/><img src="image1_thumb.jpg"/></a>]
---# Select by id print(soup.select('#images > a > img')[0].get('src'))
Output:image1_thumb.jpg
---# Select by class print(soup.select('.end')[0].text.strip())
Output:all cars are great
Using lxml XPath
from lxml import etree # Simulate a web page html = '''\ <html> <head> <title>foo</title> </head> <body> <div id="hero">Superman</div> <div class="superfoo"> <p>Hulk</p> </div> </body> </html>''' tree = etree.fromstring(html) lxml_soup = tree.xpath("//div[@class='superfoo']") print(etree.tostring(lxml_soup[0], encoding='unicode', pretty_print=True).strip())
Output:<div class="superfoo">
<p>Hulk</p>
</div>
Getting text:print(lxml_soup[0].find('p').text)
Output:Hulk
Tool for inspecting web sites
Chrome DevTools and Firefox Developer Tools are good tool for inspecting all aspect of a web-site.
Use of
Inspect
and console
in browser,eg Chrome.Execute
$x("some_xpath")
or $$("css-selectors")
in Console panel, which will both evaluate and validate.Inspect
Press Ctrl + F
to enable DOM searching in the panel.Type in
XPath
or CSS selectors
to evaluate.If there are matched elements, they will be highlighted in DOM.
XPath, CSS, DOM and Selenium cheat sheet
5 Best XPath Cheat Sheets and Quick References
I may add some to part-1.