Python Forum
Thread Rating:
  • 3 Vote(s) - 4.67 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Web-Scraping part-1
#1
Update 1-4-2018
  • All tested Python 3.6.4
  • Link to part-2(also updated with new stuff)
  • All code can be copied to run
  • Added lxml example

Library used Requests, lxml,  BeautifuSoup.
pip install beautifulsoup4 requests lxml
These are better and more updated than what's available  in standard library.

Jump strait into it and break it down later.
import requests
from bs4 import BeautifulSoup

url = 'http://CNN.com'
response = requests.get(url)
soup = BeautifulSoup(response.content, 'lxml')
print(soup.find('title').text)
Output:
CNN - Breaking News, U.S., World, Weather, Entertainment & Video News
Code is getting source code from CNN and find <title> tag and return text.
Using html.parser which is in standard library.
With BeautifulSoup can plug in a faster parser like lxml(need to be installed).
Change line 6 to:
soup = BeautifulSoup(url_get.content, 'lxml')

Using BeautifulSoup new CSS selector
from bs4 import BeautifulSoup
import requests

url = 'https://www.python.org/'
resonse = requests.get(url)
soup = BeautifulSoup(resonse.content, 'lxml')
print(soup.select('head > title')[0].text)
Output:
Welcome to Python.org

lxml
Using lxml alone which can use XPath syntax.
from lxml import html
import requests

url = 'https://www.python.org/'
resonse = requests.get(url)
tree = html.fromstring(resonse.content)
lxml_soup = tree.xpath('/html/head/title/text()')[0]
print(lxml_soup)
Output:
Welcome to Python.org

Using lxml CSS selector
pip install cssselect
from lxml import html
from lxml.cssselect import CSSSelector
import requests

url = 'https://www.python.org/'
resonse = requests.get(url)
tree = html.fromstring(resonse.content)
lxml_soup = tree.cssselect('head > title')
print(lxml_soup[0].text)
Output:
Welcome to Python.org

Break down in smaller part where we see html.

find() find first <title> tag
find_all() find all <title> tag and return a list.
from bs4 import BeautifulSoup

# Simulate a web page
html = '''\
<html>
  <head>
     <title>My Site</title>
  </head>
  <body>
     <title>First chapter</title>
     <p>Page1</p>
     <p>Page2</p>
  </body>
</html>'''

soup = BeautifulSoup(html, 'html.parser')
print(soup.find('title'))
print(soup.find_all('title'))
Output:
<title>My Site</title> [<title>My Site</title>, <title>First chapter</title>]
Using text() to get clean info out of tag.
print(soup.find('title').text)
print(soup.find_all('title')[1].text)
Output:
My Site First chapter

Example of getting info from links.
from bs4 import BeautifulSoup

# Simulate a web page
html = '''\
<body>
  <div id='images'>
    <a href='image1.html'>Name: My image 1 <br /><img src='image1_thumb.jpg' /></a>
    <a href='image2.html'>Name: My image 2 <br /><img src='image2_thumb.jpg' /></a>
    <a href='image3.html'>Name: My image 3 <br /><img src='image3_thumb.jpg' /></a>
  </div>
</body>
'''
soup = BeautifulSoup(html, 'html.parser')
print([link.get('href') for link in soup.find_all('a')])
Here we getting attribute href inside <a> tag.
Output:
['image1.html', 'image2.html', 'image3.html']
Here we are getting attribute src inside <a> tag.
print([link.get('src') for link in soup.find_all('img')])
Output:
['image1_thumb.jpg', 'image2_thumb.jpg', 'image3_thumb.jpg']


Using CSS selectors this is new in BeautifulSoup 4
from bs4 import BeautifulSoup

# Simulate a web page
html = '''\
<body>
 <div id='images'>
   <a href='image1.html'>My image 1 <br /><img src='image1_thumb.jpg' /></a>
 </div>
 <div>
   <p class="car">
     <a class="bmw" href="Link to bmw"></a>
     <a class="opel" href="Link to opel"></a>
   </p>
   <p class="end">
     all cars are great
   </p>
 </div>
'''
soup = BeautifulSoup(html, 'html.parser')
# All a tag that's inside p
print(soup.select("p > a")) 
Output:
[<a class="bmw" href="Link to bmw"></a>, <a class="ople" href="Link to opel"></a>]
---
# Select href attribute that begins with image
print(soup.select('a[href^="image"]')) 
Output:
[<a href="image1.html">Name: My image 1 <br/><img src="image1_thumb.jpg"/></a>]
---
# Select by id
print(soup.select('#images > a > img')[0].get('src')) 
Output:
image1_thumb.jpg
---
# Select by class
print(soup.select('.end')[0].text.strip()) 
Output:
all cars are great

Using lxml XPath
from lxml import etree

# Simulate a web page
html = '''\
<html>
 <head>
   <title>foo</title>
 </head>
 <body>
   <div id="hero">Superman</div>
   <div class="superfoo">
     <p>Hulk</p>
   </div>
 </body>
</html>'''

tree = etree.fromstring(html)
lxml_soup = tree.xpath("//div[@class='superfoo']")
print(etree.tostring(lxml_soup[0], encoding='unicode', pretty_print=True).strip())
Output:
<div class="superfoo">      <p>Hulk</p> </div>
Getting text:
print(lxml_soup[0].find('p').text)
Output:
Hulk

Tool for inspecting web sites
Chrome DevTools  and Firefox Developer Tools are good tool for inspecting all aspect of a web-site.

Use of Inspect and console in browser,eg Chrome.
Execute $x("some_xpath") or $$("css-selectors") in Console panel, which will both evaluate and validate.

Inspect Press Ctrl + F to enable DOM searching in the panel.
Type in XPath or CSS selectors to evaluate.
If there are matched elements, they will be highlighted in DOM.

XPath, CSS, DOM and Selenium cheat sheet
5 Best XPath Cheat Sheets and Quick References
 
I may add some to part-1.
likes this post
Reply
#2
Bump part-2 is up.
Reply
#3
Bump part-1 is updated.
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  Web-scraping part-2 snippsat 10 39,737 Oct-16-2018, 04:18 PM
Last Post: snippsat

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020