Bottom Page

Thread Rating:
  • 3 Vote(s) - 5 Average
  • 1
  • 2
  • 3
  • 4
  • 5
 Web-scraping part-2
#1
Update 1-4-2018
  • All tested Python 3.6.4
  • Added more Selenium stuff and headless mode setup
  • Added Final projects which play songs on SoundCloud

In part 2 do some practice and look at how to scrape pages with JavaScript.
Scrape and download:

Start bye doing some stuff with xkcd.
İmage


Using CSS selector for text select('#ctitle') and find() for image link.
import requests
from bs4 import BeautifulSoup
import webbrowser
import os

url = 'http://xkcd.com/1/'
url_get = requests.get(url)
soup = BeautifulSoup(url_get.content, 'lxml')
text = soup.select_one('#ctitle').text
link = soup.find('div', id='comic').find('img').get('src')
link = link.replace('//', 'http://')

# Image title and link
print('{}\n{}'.format(text, link))

# Download image
img_name = os.path.basename(link)
img = requests.get(link)
with open(img_name, 'wb') as f_out:
    f_out.write(img.content)

# Open image in browser or default image viewer
webbrowser.open_new_tab(img_name)
Output:
Barrel - Part 1 http://imgs.xkcd.com/comics/barrel_cropped_(1).jpg

Loop over pages and get images:

xkcd has a simple page structure xkcd.com/1/ xkcd.com/2/... ect
So can loop over and get images,set start and stop.
import requests
from bs4 import BeautifulSoup
import os

def image_down(start_img, stop_img):
    for numb in range(start_img, stop_img):
        url = 'http://xkcd.com/{}/'.format(numb)
        url_get = requests.get(url)
        soup = BeautifulSoup(url_get.content, 'html.parser')
        link = soup.find('div', id='comic').find('img').get('src')
        link = link.replace('//', 'http://')
        img_name = os.path.basename(link)
        try:
            img = requests.get(link)
            with open(img_name, 'wb') as f_out:
                f_out.write(img.content)
        except:
            # Just want images don't care about errors
            pass

if __name__ == '__main__':
    start_img = 1
    stop_img = 20
    image_down(start_img, stop_img)

Speed it up a lot with concurrent.futures:

concurrent.futures has a minimalistic API for Threading and Multiprocessing.
Only change one word to switch ThreadPoolExecutor(Threading) and ProcessPoolExecutor(Multiprocessing).

So if download 200 images(start_img=1, stop_img=200) it takes ca 1,10 minute to download in code over.
Will press time down to 10-sec Shocked  for 200 images.
Making all links and load 20 parallel task ProcessPoolExecutor(Multiprocessing).
import requests
from bs4 import BeautifulSoup
import concurrent.futures
import os

def image_down(url):
    url_get = requests.get(url)
    soup = BeautifulSoup(url_get.content, 'lxml')
    link = soup.find('div', id='comic').find('img').get('src')
    link = link.replace('//', 'http://')
    img_name = os.path.basename(link)
    try:
        img = requests.get(link)
        with open(img_name, 'wb') as f_out:
            f_out.write(img.content)
    except:
        # Just want images don't care about errors
        pass

if __name__ == '__main__':
    start_img = 1
    stop_img = 200
    with concurrent.futures.ProcessPoolExecutor(max_workers=20) as executor:
        for numb in range(start_img, stop_img):
            url = 'http://xkcd.com/{}/'.format(numb)
            executor.submit(image_down, url)

God dammit JavaScript, why do i not get all content Wall

JavaScript is used all over the web because it's unique position to run in Browser(client side).
This can make it more difficult to do parsing,
because Requests/bs4/lxml can not get all that's is executed/rendered bye JavaScript.

There are way to overcome this,gone use Selenium
Installation

Example with How Secure Is My Password?
So this give real time info using JavaScripts,gone enter in password 123hello in Selenium.
Then give source code to BeautifulSoup for parsing.
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from bs4 import BeautifulSoup
import time

browser = webdriver.Chrome()
'''
#-- FireFox
caps = webdriver.DesiredCapabilities().FIREFOX
caps["marionette"] = True
browser = webdriver.Firefox(capabilities=caps)
'''

url = 'https://howsecureismypassword.net/'
browser.get(url)
inputElement = browser.find_elements_by_class_name("password-input")[0]
inputElement.send_keys("123hello")
inputElement.send_keys(Keys.RETURN)
time.sleep(5) #seconds

# Give source code to BeautifulSoup
soup = BeautifulSoup(browser.page_source, 'html.parser')

# Get JavaScript info from site
top_text = soup.select_one('.result__text.result__before')
crack_time = soup.select_one('.result__text.result__time')
bottom_text  = soup.select_one('.result__text.result__after')
print(top_text.text)
print(crack_time.text)
print(bottom_text.text)
time.sleep(5) #seconds
browser.close()
Output:
It would take a computer about 1 minute to crack your password

Headless(not loading browser):

Both Chrome and FireFox now release headless mode in there newer drivers.
This mean that browser do not start(visible) as in example over.
Gone look at a simple setup for both Chrome and FireFox.

FireFox:
from selenium import webdriver
from selenium.webdriver.firefox.options import Options
import time

#--| Setup
options = Options()
options.set_headless(headless=True)
caps = webdriver.DesiredCapabilities().FIREFOX
caps["marionette"] = True
browser = webdriver.Firefox(firefox_options=options, capabilities=caps, executable_path=r"path to geckodriver")
#--| Parse
browser.get('https://www.python.org/')
time.sleep(2)
t = browser.find_element_by_xpath('//*[@id="dive-into-python"]/ul[2]/li[1]/div[1]/pre/code/span[1]')
print(t.text)
browser.quit()
Output:
# Python 3: Fibonacci series up to n
Chrome:
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
import time

#--| Setup
chrome_options = Options()
chrome_options.add_argument("--headless")
chrome_options.add_argument('--disable-gpu')
chrome_options.add_argument('--log-level=3')
browser = webdriver.Chrome(chrome_options=chrome_options, executable_path=r'path to chromedriver')
#--| Parse
browser.get('https://www.python.org/')
time.sleep(2)
t = browser.find_element_by_xpath('//*[@id="dive-into-python"]/ul[2]/li[1]/div[1]/pre/code/span[1]')
print(t.text)
browser.quit()
Output:
# Python 3: Fibonacci series up to n

Final projects:
İmage

Here gone loop over most played tracks on SoundCloud this week.
So here first has to activate mouser over play button(ActionChains/hover) then click on play button.
from selenium import webdriver
from selenium.webdriver.common.action_chains import ActionChains
import time

def play_song(how_many_songs, time_to__play):
    browser = webdriver.Chrome()
    url = 'https://soundcloud.com/charts/top?genre=all-music&country=all-countries'
    browser.get(url)
    time.sleep(3)
    for song_number in range(1, how_many_songs+1):
        play = browser.find_elements_by_xpath('//*[@id="content"]/div/div/div[1]/div[2]/div/div[3]/ul/li[{}]/div/div[2]/div[2]/a'.format(song_number))[0]
        hover = ActionChains(browser).move_to_element(play)
        hover.perform()
        play.click()
        time.sleep(time_to__play)
    browser.quit()

if __name__ == '__main__':
    how_many_songs = 5
    time_to__play = 15 # sec
    play_song(how_many_songs, time_to__play)
Quote
#2
Are you the one that used lxml a lot? It would be nice to see a side by side comparison of scraping with BS and lxml.
egslava likes this post
Quote
#3
Quote:Are you the one that used lxml a lot?
I used it more before alone,still use it but now mostly as parser trough BeautifulSoup(url_get.content, 'lxml').
So then BS get speed of lxml parser.
I use BeautifulSoup(url_get.content, 'html.parser') in tutorial,because then no need to install lxml.
Quote
#4
I was thinking more of xpath method, something like an alternative to BeautifulSoup side by side

from lxml import etree
from bs4 import BeautifulSoup

html = '<html><head><title>foo</title></head><body><div class="other"></div><div class="name"><p>foo</p></div><div class="name"><ul><li>bar</li></ul></div></body></html>'

tree = etree.fromstring(html)
for elem in tree.xpath("//div[@class='name']"):
    print(etree.tostring(elem, pretty_print=True))
     
soup = BeautifulSoup(html, 'lxml')
for elem in soup.find_all('div', {'class':'name'}):
    print(elem.prettify())

Quote
#5
(Jan-29-2017, 01:35 AM)metulburr Wrote: I was thinking more of xpath method, something like an alternative to BeautifulSoup side by side
Yeah can make a comparison,been a while since i used lxml.

So i want to use Python 3.6 and new install of BS and lxml.
Then virtual environment is the choice.
Here the install:

Using BS class_ call instead of the dict call,and BS always return Unicode.
For lxml use encoding='unicode' to get Unicode when using pretty print.
See that to get text(Hulk) is similar for both soup.find('p').text)
  • BeautifulSoup

  • lxml


CSS selector
Both BS and lxml(also XPath) support CSS selector.
Need to install CSS selector for lxml pip install cssselect-1.0.1-py2.py3-none-any.whl.

Here a quick tutorial in this Pen.
See that i change color on the text,
now using same method to scrape the content .

  • BeautifulSoup
  • Lxml
Larz60+, metulburr, wavic like this post
Quote
#6
So here gone look a getting source from the web with BeautifulSoup and lxml.
For both BS and lxml(aslo has it's own method) is advisable to use Requests.
So i install Requests into my virtual environment:

Gone use python.org as example.
We are getting the head tag which is <title>Welcome to Python.org</title>.
As mention before in part-1 use Developer Tools Chrome and FireFox(earlier FireBug) to navigate/inspect web-site.

So using method over XPath /html/head/title and CSS selector head > title,
to get the head title tag.

  • BeautifulSoup CSS selector

  • lxml XPath
  • lxml CSS selector

metulburr and Larz60+ like this post
Quote
#7
I think there should be like a "block section" added to the tutorial. Hindrances to scraping; like identifying and switching to an iframe. As well as identifying if there is a JSON for the data scraping to not have to scrape at all in the first place.
Quote
#8
Bump part-2 is updated.
Quote
#9
based on this
https://python-forum.io/Thread-Headless-...8#pid60628

Does that mean its better to use add_argument('--headless') rather than set_headless()?
Quote
#10
(Oct-15-2018, 11:59 PM)metulburr Wrote: Does that mean its better to use add_argument('--headless') rather than set_headless()?
Yes that's the new way and set_headless() is Deprecated.

I post same example here that show it can be used --headless or just comment out for not --headless.
This load browser,and search for car and show images.
from selenium import webdriver
from selenium.webdriver.firefox.options import Options
from selenium.webdriver.common.keys import Keys
import time
 
#--| Setup
options = Options()
#options.add_argument("--headless")
caps = webdriver.DesiredCapabilities().FIREFOX
caps["marionette"] = True
browser = webdriver.Firefox(firefox_options=options, capabilities=caps, executable_path=r"geckodriver.exe")
#--| Parse or automation
browser.get('https://duckduckgo.com')
input_field = browser.find_elements_by_css_selector('#search_form_input_homepage')
input_field[0].send_keys('car' + Keys.RETURN)
time.sleep(3)
images_link = browser.find_elements_by_link_text('Images')
images_link[0].click()
time.sleep(5)
browser.quit()
--headless parse a value,also this do not load Browser.
from selenium import webdriver
from selenium.webdriver.firefox.options import Options
import time
 
#--| Setup
options = Options()
options.add_argument("--headless")
caps = webdriver.DesiredCapabilities().FIREFOX
caps["marionette"] = True
browser = webdriver.Firefox(firefox_options=options, capabilities=caps, executable_path=r"geckodriver.exe")
#--| Parse
browser.get('https://duckduckgo.com')
logo = browser.find_elements_by_css_selector('#logo_homepage_link')
print(logo[0].text)
browser.quit()
Output:
About DuckDuckGo
metulburr likes this post
Quote

Top Page

Possibly Related Threads...
Thread Author Replies Views Last Post
  Web-Scraping part-1 snippsat 2 11,166 Jun-08-2017, 10:55 PM
Last Post: snippsat
  Web scraping with Scrapy metulburr 1 2,956 Oct-05-2016, 10:03 PM
Last Post: metulburr

Forum Jump:


Users browsing this thread: 1 Guest(s)