Update 1-4-2018
In part 2 do some practice and look at how to scrape pages with JavaScript.
Scrape and download:
Start bye doing some stuff with xkcd.
[Image: AL3Z2m.jpg]
Using CSS selector for text
Loop over pages and get images:
xkcd has a simple page structure
So can loop over and get images,set start and stop.
Speed it up a lot with concurrent.futures:
concurrent.futures has a minimalistic API for Threading and Multiprocessing.
Only change one word to switch ThreadPoolExecutor(
So if download 200 images(start_img=1, stop_img=200) it takes ca
Will press time down to
Making all links and load 20 parallel task ProcessPoolExecutor(
JavaScript,why do i not get all content
JavaScript is used all over the web because it's unique position to run in Browser(client side).
This can make it more difficult to do parsing,
because
There are way to overcome this,gone use Selenium
Installation
Example with How Secure Is My Password?
So this give real time info using JavaScripts,gone enter in password
Then give source code to BeautifulSoup for parsing.
Headless(not loading browser):
Both Chrome and FireFox now release headless mode in there newer drivers.
This mean that browser do not start(visible) as in example over.
Gone look at a simple setup for both Chrome and FireFox.
Final projects:
Here gone loop over most played tracks on SoundCloud this week.
So here first has to activate mouser over play button(
- All tested Python 3.6.4
- Added more Selenium stuff and headless mode setup
- Added Final projects which play songs on SoundCloud
- Link to Web-Scraping part 1
In part 2 do some practice and look at how to scrape pages with JavaScript.
Scrape and download:
Start bye doing some stuff with xkcd.
[Image: AL3Z2m.jpg]
Using CSS selector for text
select('#ctitle')
and find()
for image link.import requests from bs4 import BeautifulSoup import webbrowser import os url = 'http://xkcd.com/1/' url_get = requests.get(url) soup = BeautifulSoup(url_get.content, 'lxml') text = soup.select_one('#ctitle').text link = soup.find('div', id='comic').find('img').get('src') link = link.replace('//', 'http://') # Image title and link print('{}\n{}'.format(text, link)) # Download image img_name = os.path.basename(link) img = requests.get(link) with open(img_name, 'wb') as f_out: f_out.write(img.content) # Open image in browser or default image viewer webbrowser.open_new_tab(img_name)
Output:Barrel - Part 1
http://imgs.xkcd.com/comics/barrel_cropped_(1).jpg
Loop over pages and get images:
xkcd has a simple page structure
xkcd.com/1/ xkcd.com/2/... ect
So can loop over and get images,set start and stop.
import requests from bs4 import BeautifulSoup import os def image_down(start_img, stop_img): for numb in range(start_img, stop_img): url = 'http://xkcd.com/{}/'.format(numb) url_get = requests.get(url) soup = BeautifulSoup(url_get.content, 'html.parser') link = soup.find('div', id='comic').find('img').get('src') link = link.replace('//', 'http://') img_name = os.path.basename(link) try: img = requests.get(link) with open(img_name, 'wb') as f_out: f_out.write(img.content) except: # Just want images don't care about errors pass if __name__ == '__main__': start_img = 1 stop_img = 20 image_down(start_img, stop_img)
Speed it up a lot with concurrent.futures:
concurrent.futures has a minimalistic API for Threading and Multiprocessing.
Only change one word to switch ThreadPoolExecutor(
Threading
) and ProcessPoolExecutor(Multiprocessing
).So if download 200 images(start_img=1, stop_img=200) it takes ca
1,10 minute
to download in code over.Will press time down to
10-sec
for 200 images
.Making all links and load 20 parallel task ProcessPoolExecutor(
Multiprocessing
).import requests from bs4 import BeautifulSoup import concurrent.futures import os def image_down(url): url_get = requests.get(url) soup = BeautifulSoup(url_get.content, 'lxml') link = soup.find('div', id='comic').find('img').get('src') link = link.replace('//', 'http://') img_name = os.path.basename(link) try: img = requests.get(link) with open(img_name, 'wb') as f_out: f_out.write(img.content) except: # Just want images don't care about errors pass if __name__ == '__main__': start_img = 1 stop_img = 200 with concurrent.futures.ProcessPoolExecutor(max_workers=20) as executor: for numb in range(start_img, stop_img): url = 'http://xkcd.com/{}/'.format(numb) executor.submit(image_down, url)
JavaScript,why do i not get all content
JavaScript is used all over the web because it's unique position to run in Browser(client side).
This can make it more difficult to do parsing,
because
Requests/bs4/lxml
can not get all that's is executed/rendered bye JavaScript.There are way to overcome this,gone use Selenium
Installation
Example with How Secure Is My Password?
So this give real time info using JavaScripts,gone enter in password
123hello
in Selenium.Then give source code to BeautifulSoup for parsing.
from selenium import webdriver from selenium.webdriver.common.keys import Keys from bs4 import BeautifulSoup import time browser = webdriver.Chrome() ''' #-- FireFox caps = webdriver.DesiredCapabilities().FIREFOX caps["marionette"] = True browser = webdriver.Firefox(capabilities=caps) ''' url = 'https://howsecureismypassword.net/' browser.get(url) inputElement = browser.find_elements_by_class_name("password-input")[0] inputElement.send_keys("123hello") inputElement.send_keys(Keys.RETURN) time.sleep(5) #seconds # Give source code to BeautifulSoup soup = BeautifulSoup(browser.page_source, 'html.parser') # Get JavaScript info from site top_text = soup.select_one('.result__text.result__before') crack_time = soup.select_one('.result__text.result__time') bottom_text = soup.select_one('.result__text.result__after') print(top_text.text) print(crack_time.text) print(bottom_text.text) time.sleep(5) #seconds browser.close()
Output:It would take a computer about
1 minute
to crack your password
Headless(not loading browser):
Both Chrome and FireFox now release headless mode in there newer drivers.
This mean that browser do not start(visible) as in example over.
Gone look at a simple setup for both Chrome and FireFox.
FireFox:
from selenium import webdriver from selenium.webdriver.firefox.options import Options import time #--| Setup options = Options() options.set_headless(headless=True) caps = webdriver.DesiredCapabilities().FIREFOX caps["marionette"] = True browser = webdriver.Firefox(firefox_options=options, capabilities=caps, executable_path=r"path to geckodriver") #--| Parse browser.get('https://www.python.org/') time.sleep(2) t = browser.find_element_by_xpath('//*[@id="dive-into-python"]/ul[2]/li[1]/div[1]/pre/code/span[1]') print(t.text) browser.quit()
Output:# Python 3: Fibonacci series up to n
Chrome:
from selenium import webdriver from selenium.webdriver.chrome.options import Options import time #--| Setup chrome_options = Options() chrome_options.add_argument("--headless") chrome_options.add_argument('--disable-gpu') chrome_options.add_argument('--log-level=3') browser = webdriver.Chrome(chrome_options=chrome_options, executable_path=r'path to chromedriver') #--| Parse browser.get('https://www.python.org/') time.sleep(2) t = browser.find_element_by_xpath('//*[@id="dive-into-python"]/ul[2]/li[1]/div[1]/pre/code/span[1]') print(t.text) browser.quit()
Output:# Python 3: Fibonacci series up to n
Final projects:
Here gone loop over most played tracks on SoundCloud this week.
So here first has to activate mouser over play button(
ActionChains/hover
) then click on play button.from selenium import webdriver from selenium.webdriver.common.action_chains import ActionChains import time def play_song(how_many_songs, time_to__play): browser = webdriver.Chrome() url = 'https://soundcloud.com/charts/top?genre=all-music&country=all-countries' browser.get(url) time.sleep(3) for song_number in range(1, how_many_songs+1): play = browser.find_elements_by_xpath('//*[@id="content"]/div/div/div[1]/div[2]/div/div[3]/ul/li[{}]/div/div[2]/div[2]/a'.format(song_number))[0] hover = ActionChains(browser).move_to_element(play) hover.perform() play.click() time.sleep(time_to__play) browser.quit() if __name__ == '__main__': how_many_songs = 5 time_to__play = 15 # sec play_song(how_many_songs, time_to__play)