Python Forum
Thread Rating:
  • 3 Vote(s) - 5 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Web-scraping part-2
#1
Update 1-4-2018
  • All tested Python 3.6.4
  • Added more Selenium stuff and headless mode setup
  • Added Final projects which play songs on SoundCloud
  • Link to Web-Scraping part 1

In part 2 do some practice and look at how to scrape pages with JavaScript.
Scrape and download:

Start bye doing some stuff with xkcd.
[Image: AL3Z2m.jpg]

Using CSS selector for text select('#ctitle') and find() for image link.
import requests
from bs4 import BeautifulSoup
import webbrowser
import os

url = 'http://xkcd.com/1/'
url_get = requests.get(url)
soup = BeautifulSoup(url_get.content, 'lxml')
text = soup.select_one('#ctitle').text
link = soup.find('div', id='comic').find('img').get('src')
link = link.replace('//', 'http://')

# Image title and link
print('{}\n{}'.format(text, link))

# Download image
img_name = os.path.basename(link)
img = requests.get(link)
with open(img_name, 'wb') as f_out:
    f_out.write(img.content)

# Open image in browser or default image viewer
webbrowser.open_new_tab(img_name)
Output:
Barrel - Part 1 http://imgs.xkcd.com/comics/barrel_cropped_(1).jpg

Loop over pages and get images:

xkcd has a simple page structure xkcd.com/1/ xkcd.com/2/... ect
So can loop over and get images,set start and stop.
import requests
from bs4 import BeautifulSoup
import os

def image_down(start_img, stop_img):
    for numb in range(start_img, stop_img):
        url = 'http://xkcd.com/{}/'.format(numb)
        url_get = requests.get(url)
        soup = BeautifulSoup(url_get.content, 'html.parser')
        link = soup.find('div', id='comic').find('img').get('src')
        link = link.replace('//', 'http://')
        img_name = os.path.basename(link)
        try:
            img = requests.get(link)
            with open(img_name, 'wb') as f_out:
                f_out.write(img.content)
        except:
            # Just want images don't care about errors
            pass

if __name__ == '__main__':
    start_img = 1
    stop_img = 20
    image_down(start_img, stop_img)

Speed it up a lot with concurrent.futures:

concurrent.futures has a minimalistic API for Threading and Multiprocessing.
Only change one word to switch ThreadPoolExecutor(Threading) and ProcessPoolExecutor(Multiprocessing).

So if download 200 images(start_img=1, stop_img=200) it takes ca 1,10 minute to download in code over.
Will press time down to 10-sec Shocked  for 200 images.
Making all links and load 20 parallel task ProcessPoolExecutor(Multiprocessing).
import requests
from bs4 import BeautifulSoup
import concurrent.futures
import os

def image_down(url):
    url_get = requests.get(url)
    soup = BeautifulSoup(url_get.content, 'lxml')
    link = soup.find('div', id='comic').find('img').get('src')
    link = link.replace('//', 'http://')
    img_name = os.path.basename(link)
    try:
        img = requests.get(link)
        with open(img_name, 'wb') as f_out:
            f_out.write(img.content)
    except:
        # Just want images don't care about errors
        pass

if __name__ == '__main__':
    start_img = 1
    stop_img = 200
    with concurrent.futures.ProcessPoolExecutor(max_workers=20) as executor:
        for numb in range(start_img, stop_img):
            url = 'http://xkcd.com/{}/'.format(numb)
            executor.submit(image_down, url)

JavaScript,why do i not get all content Wall

JavaScript is used all over the web because it's unique position to run in Browser(client side).
This can make it more difficult to do parsing,
because Requests/bs4/lxml can not get all that's is executed/rendered bye JavaScript.

There are way to overcome this,gone use Selenium
Installation

Example with How Secure Is My Password?
So this give real time info using JavaScripts,gone enter in password 123hello in Selenium.
Then give source code to BeautifulSoup for parsing.
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from bs4 import BeautifulSoup
import time

browser = webdriver.Chrome()
'''
#-- FireFox
caps = webdriver.DesiredCapabilities().FIREFOX
caps["marionette"] = True
browser = webdriver.Firefox(capabilities=caps)
'''

url = 'https://howsecureismypassword.net/'
browser.get(url)
inputElement = browser.find_elements_by_class_name("password-input")[0]
inputElement.send_keys("123hello")
inputElement.send_keys(Keys.RETURN)
time.sleep(5) #seconds

# Give source code to BeautifulSoup
soup = BeautifulSoup(browser.page_source, 'html.parser')

# Get JavaScript info from site
top_text = soup.select_one('.result__text.result__before')
crack_time = soup.select_one('.result__text.result__time')
bottom_text  = soup.select_one('.result__text.result__after')
print(top_text.text)
print(crack_time.text)
print(bottom_text.text)
time.sleep(5) #seconds
browser.close()
Output:
It would take a computer about 1 minute to crack your password

Headless(not loading browser):

Both Chrome and FireFox now release headless mode in there newer drivers.
This mean that browser do not start(visible) as in example over.
Gone look at a simple setup for both Chrome and FireFox.

FireFox:
from selenium import webdriver
from selenium.webdriver.firefox.options import Options
import time

#--| Setup
options = Options()
options.set_headless(headless=True)
caps = webdriver.DesiredCapabilities().FIREFOX
caps["marionette"] = True
browser = webdriver.Firefox(firefox_options=options, capabilities=caps, executable_path=r"path to geckodriver")
#--| Parse
browser.get('https://www.python.org/')
time.sleep(2)
t = browser.find_element_by_xpath('//*[@id="dive-into-python"]/ul[2]/li[1]/div[1]/pre/code/span[1]')
print(t.text)
browser.quit()
Output:
# Python 3: Fibonacci series up to n
Chrome:
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
import time

#--| Setup
chrome_options = Options()
chrome_options.add_argument("--headless")
chrome_options.add_argument('--disable-gpu')
chrome_options.add_argument('--log-level=3')
browser = webdriver.Chrome(chrome_options=chrome_options, executable_path=r'path to chromedriver')
#--| Parse
browser.get('https://www.python.org/')
time.sleep(2)
t = browser.find_element_by_xpath('//*[@id="dive-into-python"]/ul[2]/li[1]/div[1]/pre/code/span[1]')
print(t.text)
browser.quit()
Output:
# Python 3: Fibonacci series up to n

Final projects:
[Image: Qr8P7Q.png]
Here gone loop over most played tracks on SoundCloud this week.
So here first has to activate mouser over play button(ActionChains/hover) then click on play button.
from selenium import webdriver
from selenium.webdriver.common.action_chains import ActionChains
import time

def play_song(how_many_songs, time_to__play):
    browser = webdriver.Chrome()
    url = 'https://soundcloud.com/charts/top?genre=all-music&country=all-countries'
    browser.get(url)
    time.sleep(3)
    for song_number in range(1, how_many_songs+1):
        play = browser.find_elements_by_xpath('//*[@id="content"]/div/div/div[1]/div[2]/div/div[3]/ul/li[{}]/div/div[2]/div[2]/a'.format(song_number))[0]
        hover = ActionChains(browser).move_to_element(play)
        hover.perform()
        play.click()
        time.sleep(time_to__play)
    browser.quit()

if __name__ == '__main__':
    how_many_songs = 5
    time_to__play = 15 # sec
    play_song(how_many_songs, time_to__play)
Reply


Messages In This Thread
Web-scraping part-2 - by snippsat - Oct-30-2016, 11:21 PM
RE: Web-scraping part-2 - by metulburr - Oct-31-2016, 12:56 AM
RE: Web-scraping part-2 - by metulburr - Oct-31-2016, 12:56 AM
RE: Web-scraping part-2 - by snippsat - Oct-31-2016, 01:17 AM
RE: Web-scraping part-2 - by metulburr - Jan-29-2017, 01:35 AM
RE: Web-scraping part-2 - by snippsat - Jan-29-2017, 03:49 PM
RE: Web-scraping part-2 - by snippsat - Jan-30-2017, 01:43 PM
RE: Web-scraping part-2 - by metulburr - Oct-21-2017, 12:10 AM
RE: Web-scraping part-2 - by snippsat - Apr-01-2018, 02:06 AM
RE: Web-scraping part-2 - by metulburr - Oct-15-2018, 11:59 PM
RE: Web-scraping part-2 - by snippsat - Oct-16-2018, 04:18 PM

Possibly Related Threads…
Thread Author Replies Views Last Post
  Web-Scraping part-1 snippsat 2 34,828 Jun-08-2017, 10:55 PM
Last Post: snippsat

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020