Web-Scraping part-1 - Printable Version +- Python Forum (https://python-forum.io) +-- Forum: General (https://python-forum.io/forum-1.html) +--- Forum: Tutorials (https://python-forum.io/forum-4.html) +---- Forum: Web Scraping (https://python-forum.io/forum-43.html) +---- Thread: Web-Scraping part-1 (/thread-144.html) |
Web-Scraping part-1 - snippsat - Sep-23-2016 Update 1-4-2018
Library used Requests, lxml, BeautifuSoup. pip install beautifulsoup4 requests lxml These are better and more updated than what's available in standard library. Jump strait into it and break it down later. import requests from bs4 import BeautifulSoup url = 'http://CNN.com' response = requests.get(url) soup = BeautifulSoup(response.content, 'lxml') print(soup.find('title').text) Code is getting source code from CNN and find <title> tag and return text.Using html.parser which is in standard library.With BeautifulSoup can plug in a faster parser like lxml(need to be installed). Change line 6 to:soup = BeautifulSoup(url_get.content, 'lxml') Using BeautifulSoup new CSS selector from bs4 import BeautifulSoup import requests url = 'https://www.python.org/' resonse = requests.get(url) soup = BeautifulSoup(resonse.content, 'lxml') print(soup.select('head > title')[0].text)
lxml Using lxml alone which can use XPath syntax. from lxml import html import requests url = 'https://www.python.org/' resonse = requests.get(url) tree = html.fromstring(resonse.content) lxml_soup = tree.xpath('/html/head/title/text()')[0] print(lxml_soup)
Using lxml CSS selector pip install cssselect from lxml import html from lxml.cssselect import CSSSelector import requests url = 'https://www.python.org/' resonse = requests.get(url) tree = html.fromstring(resonse.content) lxml_soup = tree.cssselect('head > title') print(lxml_soup[0].text)
Break down in smaller part where we see html. find() find first <title> tagfind_all() find all <title> tag and return a list.from bs4 import BeautifulSoup # Simulate a web page html = '''\ <html> <head> <title>My Site</title> </head> <body> <title>First chapter</title> <p>Page1</p> <p>Page2</p> </body> </html>''' soup = BeautifulSoup(html, 'html.parser') print(soup.find('title')) print(soup.find_all('title')) Using text() to get clean info out of tag.print(soup.find('title').text) print(soup.find_all('title')[1].text)
Example of getting info from links. from bs4 import BeautifulSoup # Simulate a web page html = '''\ <body> <div id='images'> <a href='image1.html'>Name: My image 1 <br /><img src='image1_thumb.jpg' /></a> <a href='image2.html'>Name: My image 2 <br /><img src='image2_thumb.jpg' /></a> <a href='image3.html'>Name: My image 3 <br /><img src='image3_thumb.jpg' /></a> </div> </body> ''' soup = BeautifulSoup(html, 'html.parser') print([link.get('href') for link in soup.find_all('a')])Here we getting attribute href inside <a> tag. Here we are getting attribute src inside <a> tag.print([link.get('src') for link in soup.find_all('img')])
Using CSS selectors this is new in BeautifulSoup 4 from bs4 import BeautifulSoup # Simulate a web page html = '''\ <body> <div id='images'> <a href='image1.html'>My image 1 <br /><img src='image1_thumb.jpg' /></a> </div> <div> <p class="car"> <a class="bmw" href="Link to bmw"></a> <a class="opel" href="Link to opel"></a> </p> <p class="end"> all cars are great </p> </div> ''' soup = BeautifulSoup(html, 'html.parser') # All a tag that's inside p print(soup.select("p > a")) ---# Select href attribute that begins with image print(soup.select('a[href^="image"]')) ---# Select by id print(soup.select('#images > a > img')[0].get('src')) ---# Select by class print(soup.select('.end')[0].text.strip())
Using lxml XPath from lxml import etree # Simulate a web page html = '''\ <html> <head> <title>foo</title> </head> <body> <div id="hero">Superman</div> <div class="superfoo"> <p>Hulk</p> </div> </body> </html>''' tree = etree.fromstring(html) lxml_soup = tree.xpath("//div[@class='superfoo']") print(etree.tostring(lxml_soup[0], encoding='unicode', pretty_print=True).strip()) Getting text:print(lxml_soup[0].find('p').text)
Tool for inspecting web sites Chrome DevTools and Firefox Developer Tools are good tool for inspecting all aspect of a web-site. Use of Inspect and console in browser,eg Chrome.Execute $x("some_xpath") or $$("css-selectors") in Console panel, which will both evaluate and validate.Inspect Press Ctrl + F to enable DOM searching in the panel.Type in XPath or CSS selectors to evaluate.If there are matched elements, they will be highlighted in DOM. XPath, CSS, DOM and Selenium cheat sheet 5 Best XPath Cheat Sheets and Quick References I may add some to part-1. RE: Web-Scraping part-1 - snippsat - Oct-30-2016 Bump part-2 is up. RE: Web-Scraping part-1 - snippsat - Jun-08-2017 Bump part-1 is updated. |