Hello guest, if you read this it means you are not registered. Click here to register in a few simple steps, you will enjoy all features of our Forum.
Bottom Page

Thread Rating:
  • 3 Vote(s) - 5 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Web-scraping part-2
#1
In part 2 do some practice and look at how to scrape pages with JavaScript.
Start bye doing some stuff with xkcd.
1. Scrape text and url image link,download image
2. Loop over pages and get images
3. Speed it up a lot with concurrent.futures

1. Scrape text and url image link,download image

I did mention in part-1 it's okay to use Chrome dev-toolFireBug to inspect the page.
Here a image with: 
So have all info needed,and can write this code.
Use CSS selector for text select('#ctitle') and find() for image link.
Image download is done with urlretrieve().
import requests
from bs4 import BeautifulSoup
from urllib.request import urlretrieve
import webbrowser
import os

url = 'http://xkcd.com/1/'
url_get = requests.get(url)
soup = BeautifulSoup(url_get.content, 'html.parser')
text = soup.select_one('#ctitle').text
link = soup.find('div', id='comic').find('img').get('src')
link = link.replace('//', 'http://')

# Image title and link
print('{}\n{}'.format(text, link))

# Download image
img_name = os.path.basename(link)
urlretrieve(link, img_name)

# Open image in browser new tab
webbrowser.open_new_tab(link)
Output:
Barrel - Part 1 http://imgs.xkcd.com/comics/barrel_cropped_(1).jpg

2. Loop over pages and get images

xkcd is has a simple page structure xkcd.com/1/ xkcd.com/2/... ect
So can loop over and get images,set start and stop.
import requests
from bs4 import BeautifulSoup
from urllib.request import urlretrieve
import os

def image_down(start_img, stop_img):
    for numb in range(start_img, stop_img):
        url = 'http://xkcd.com/{}/'.format(numb)
        url_get = requests.get(url)
        soup = BeautifulSoup(url_get.content, 'html.parser')
        link = soup.find('div', id='comic').find('img').get('src')
        link = link.replace('//', 'http://')
        img_name = os.path.basename(link)
        try:
            urlretrieve(link, img_name)
        except:
            # Just want images don't care about errors
            pass

if __name__ == '__main__':
    start_img = 1
    stop_img = 20
    image_down(start_img, stop_img)

3. Speed it up a lot with concurrent.futures 

concurrent.futures has a minimalistic API for Threading and Multiprocessing.
Only change one word to switch ThreadPoolExecutor(Threading) and ProcessPoolExecutor(Multiprocessing).

So if download 200 images(start_img=1, stop_img=200) it takes ca 1,10 minute to download in code over.
Will press time down to 10-sec Shocked  for 200 images.
Making all links and load 20 parallel task ProcessPoolExecutor(Multiprocessing).
import requests
from bs4 import BeautifulSoup
from urllib.request import urlretrieve
import concurrent.futures
import os

def image_down(url):
    url_get = requests.get(url)
    soup = BeautifulSoup(url_get.content, 'lxml')
    link = soup.find('div', id='comic').find('img').get('src')
    link = link.replace('//', 'http://')
    img_name = os.path.basename(link)
    try:
        urlretrieve(link, img_name)
    except:
        # Just want images don't care about errors
        pass

if __name__ == '__main__':
    start_img = 1
    stop_img = 200
    with concurrent.futures.ProcessPoolExecutor(max_workers=20) as executor:
        for numb in range(start_img, stop_img):
            url = 'http://xkcd.com/{}/'.format(numb)
            executor.submit(image_down, url)

666. God dammit JavaScript Wall
JavaScript is used all over the web because it's unique position to run in Browser(client side).
This can make it more difficult to do parsing,
because Requests/urllib can not  get all that's is executed/rendered bye JavaScript.

There are several way to overcome this,gone give a demo of Selenium and PhantomJs.
This is a strong duo with a lot of power.
PhantomJs is used trough Selenium when not want to load a browser.

Example with How Secure Is My Password?
So this give real time feedback,gone enter in password 123hello in Selenium.
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from bs4 import BeautifulSoup
import time

#browser = webdriver.PhantomJS()
browser = webdriver.Firefox()

url = 'https://howsecureismypassword.net/'
browser.get(url)
inputElement = browser.find_elements_by_class_name("password-input")[0]
inputElement.send_keys("123hello")
inputElement.send_keys(Keys.RETURN)
time.sleep(3)
browser.close()

# Give source code to BeautifulSoup
soup = BeautifulSoup(browser.page_source, 'html.parser')

# Crack time info
top_text = soup.select_one('.result__text.result__before')
crack_time = soup.select_one('.result__text.result__time')
bottom_text  = soup.select_one('.result__text.result__after')
print(top_text.text)
print(crack_time.text)
print(bottom.text)
Output:
It would take a computer about 1 minute to crack your password
For not loading browser only change.
browser = webdriver.PhantomJS()
#browser = webdriver.Firefox()
PhantomJS executable need to be in same folder as script.
Or give it a path eg PhantomJS(executable_path='C:/phantom/phantomjs').
Quote
#2
Are you the one that used lxml a lot? It would be nice to see a side by side comparison of scraping with BS and lxml.
Quote
#3
Quote:Are you the one that used lxml a lot?
I used it more before alone,still use it but now mostly as parser trough BeautifulSoup(url_get.content, 'lxml').
So then BS get speed of lxml parser.
I use BeautifulSoup(url_get.content, 'html.parser') in tutorial,because then no need to install lxml.
Quote
#4
I was thinking more of xpath method, something like an alternative to BeautifulSoup side by side

from lxml import etree
from bs4 import BeautifulSoup

html = '<html><head><title>foo</title></head><body><div class="other"></div><div class="name"><p>foo</p></div><div class="name"><ul><li>bar</li></ul></div></body></html>'

tree = etree.fromstring(html)
for elem in tree.xpath("//div[@class='name']"):
    print(etree.tostring(elem, pretty_print=True))
     
soup = BeautifulSoup(html, 'lxml')
for elem in soup.find_all('div', {'class':'name'}):
    print(elem.prettify())

Quote
#5
(Jan-29-2017, 01:35 AM)metulburr Wrote: I was thinking more of xpath method, something like an alternative to BeautifulSoup side by side
Yeah can make a comparison,been a while since i used lxml.

So i want to use Python 3.6 and new install of BS and lxml.
Then virtual environment is the choice.
Here the install:

Using BS class_ call instead of the dict call,and BS always return Unicode.
For lxml use encoding='unicode' to get Unicode when using pretty print.
See that to get text(Hulk) is similar for both soup.find('p').text)
  • BeautifulSoup

  • lxml


CSS selector
Both BS and lxml(also XPath) support CSS selector.
Need to install CSS selector for lxml pip install cssselect-1.0.1-py2.py3-none-any.whl.

Here a quick tutorial in this Pen.
See that i change color on the text,
now using same method to scrape the content .

  • BeautifulSoup
  • Lxml
metulburr and wavic like this post
Quote
#6
So here gone look a getting source from the web with BeautifulSoup and lxml.
For both BS and lxml(aslo has it's own method) is advisable to use Requests.
So i install Requests into my virtual environment:

Gone use python.org as example.
We are getting the head tag which is <title>Welcome to Python.org</title>.
As mention before in part-1 use Developer Tools Chrome and FireFox(earlier FireBug) to navigate/inspect web-site.

So using method over XPath /html/head/title and CSS selector head > title,
to get the head title tag.

  • BeautifulSoup CSS selector

  • lxml XPath
  • lxml CSS selector

metulburr likes this post
Quote

Top Page

Possibly Related Threads...
Thread Author Replies Views Last Post
  Web-Scraping part-1 snippsat 2 3,362 Jun-08-2017, 10:55 PM
Last Post: snippsat
  Basic DC Electronics- Resistors Part III sparkz_alot 0 274 May-31-2017, 03:36 PM
Last Post: sparkz_alot
  Basic DC Electronics- Resistors Part II sparkz_alot 0 289 May-29-2017, 11:23 PM
Last Post: sparkz_alot
  Basic DC Electronics- Resistors Part I sparkz_alot 0 826 May-25-2017, 07:52 PM
Last Post: sparkz_alot
  [Basic] [Part-1]Linux Python 3 environment snippsat 1 823 May-25-2017, 05:14 PM
Last Post: snippsat
  [Basic] [Part-2]Python environment Windows snippsat 4 910 May-24-2017, 10:44 AM
Last Post: metulburr
  [Basic] [Part-1]Python 3.6 and pip installation under Windows snippsat 1 2,113 May-16-2017, 11:46 PM
Last Post: metulburr
  Web scraping with Scrapy metulburr 1 883 Oct-05-2016, 10:03 PM
Last Post: metulburr
  Creating C extensions, Part 2 metulburr 1 583 Oct-05-2016, 10:00 PM
Last Post: metulburr
  Creating C extensions, Part 1 metulburr 1 572 Oct-05-2016, 09:57 PM
Last Post: metulburr

Forum Jump:


Users browsing this thread: 1 Guest(s)