Web-scraping part-2 - Printable Version +- Python Forum (https://python-forum.io) +-- Forum: General (https://python-forum.io/forum-1.html) +--- Forum: Tutorials (https://python-forum.io/forum-4.html) +---- Forum: Web Scraping (https://python-forum.io/forum-43.html) +---- Thread: Web-scraping part-2 (/thread-695.html) Pages:
1
2
|
Web-scraping part-2 - snippsat - Oct-30-2016 Update 1-4-2018
In part 2 do some practice and look at how to scrape pages with JavaScript. Scrape and download: Start bye doing some stuff with xkcd. [Image: AL3Z2m.jpg] Using CSS selector for text select('#ctitle') and find() for image link.import requests from bs4 import BeautifulSoup import webbrowser import os url = 'http://xkcd.com/1/' url_get = requests.get(url) soup = BeautifulSoup(url_get.content, 'lxml') text = soup.select_one('#ctitle').text link = soup.find('div', id='comic').find('img').get('src') link = link.replace('//', 'http://') # Image title and link print('{}\n{}'.format(text, link)) # Download image img_name = os.path.basename(link) img = requests.get(link) with open(img_name, 'wb') as f_out: f_out.write(img.content) # Open image in browser or default image viewer webbrowser.open_new_tab(img_name)
Loop over pages and get images: xkcd has a simple page structure xkcd.com/1/ xkcd.com/2/... ect So can loop over and get images,set start and stop. import requests from bs4 import BeautifulSoup import os def image_down(start_img, stop_img): for numb in range(start_img, stop_img): url = 'http://xkcd.com/{}/'.format(numb) url_get = requests.get(url) soup = BeautifulSoup(url_get.content, 'html.parser') link = soup.find('div', id='comic').find('img').get('src') link = link.replace('//', 'http://') img_name = os.path.basename(link) try: img = requests.get(link) with open(img_name, 'wb') as f_out: f_out.write(img.content) except: # Just want images don't care about errors pass if __name__ == '__main__': start_img = 1 stop_img = 20 image_down(start_img, stop_img) Speed it up a lot with concurrent.futures: concurrent.futures has a minimalistic API for Threading and Multiprocessing. Only change one word to switch ThreadPoolExecutor( Threading ) and ProcessPoolExecutor(Multiprocessing ).So if download 200 images(start_img=1, stop_img=200) it takes ca 1,10 minute to download in code over.Will press time down to 10-sec for 200 images .Making all links and load 20 parallel task ProcessPoolExecutor( Multiprocessing ).import requests from bs4 import BeautifulSoup import concurrent.futures import os def image_down(url): url_get = requests.get(url) soup = BeautifulSoup(url_get.content, 'lxml') link = soup.find('div', id='comic').find('img').get('src') link = link.replace('//', 'http://') img_name = os.path.basename(link) try: img = requests.get(link) with open(img_name, 'wb') as f_out: f_out.write(img.content) except: # Just want images don't care about errors pass if __name__ == '__main__': start_img = 1 stop_img = 200 with concurrent.futures.ProcessPoolExecutor(max_workers=20) as executor: for numb in range(start_img, stop_img): url = 'http://xkcd.com/{}/'.format(numb) executor.submit(image_down, url) JavaScript,why do i not get all content JavaScript is used all over the web because it's unique position to run in Browser(client side). This can make it more difficult to do parsing, because Requests/bs4/lxml can not get all that's is executed/rendered bye JavaScript.There are way to overcome this,gone use Selenium Installation Example with How Secure Is My Password? So this give real time info using JavaScripts,gone enter in password 123hello in Selenium.Then give source code to BeautifulSoup for parsing. from selenium import webdriver from selenium.webdriver.common.keys import Keys from bs4 import BeautifulSoup import time browser = webdriver.Chrome() ''' #-- FireFox caps = webdriver.DesiredCapabilities().FIREFOX caps["marionette"] = True browser = webdriver.Firefox(capabilities=caps) ''' url = 'https://howsecureismypassword.net/' browser.get(url) inputElement = browser.find_elements_by_class_name("password-input")[0] inputElement.send_keys("123hello") inputElement.send_keys(Keys.RETURN) time.sleep(5) #seconds # Give source code to BeautifulSoup soup = BeautifulSoup(browser.page_source, 'html.parser') # Get JavaScript info from site top_text = soup.select_one('.result__text.result__before') crack_time = soup.select_one('.result__text.result__time') bottom_text = soup.select_one('.result__text.result__after') print(top_text.text) print(crack_time.text) print(bottom_text.text) time.sleep(5) #seconds browser.close()
Headless(not loading browser): Both Chrome and FireFox now release headless mode in there newer drivers. This mean that browser do not start(visible) as in example over. Gone look at a simple setup for both Chrome and FireFox. FireFox: from selenium import webdriver from selenium.webdriver.firefox.options import Options import time #--| Setup options = Options() options.set_headless(headless=True) caps = webdriver.DesiredCapabilities().FIREFOX caps["marionette"] = True browser = webdriver.Firefox(firefox_options=options, capabilities=caps, executable_path=r"path to geckodriver") #--| Parse browser.get('https://www.python.org/') time.sleep(2) t = browser.find_element_by_xpath('//*[@id="dive-into-python"]/ul[2]/li[1]/div[1]/pre/code/span[1]') print(t.text) browser.quit()
Chrome: from selenium import webdriver from selenium.webdriver.chrome.options import Options import time #--| Setup chrome_options = Options() chrome_options.add_argument("--headless") chrome_options.add_argument('--disable-gpu') chrome_options.add_argument('--log-level=3') browser = webdriver.Chrome(chrome_options=chrome_options, executable_path=r'path to chromedriver') #--| Parse browser.get('https://www.python.org/') time.sleep(2) t = browser.find_element_by_xpath('//*[@id="dive-into-python"]/ul[2]/li[1]/div[1]/pre/code/span[1]') print(t.text) browser.quit()
Final projects: Here gone loop over most played tracks on SoundCloud this week. So here first has to activate mouser over play button( ActionChains/hover ) then click on play button.from selenium import webdriver from selenium.webdriver.common.action_chains import ActionChains import time def play_song(how_many_songs, time_to__play): browser = webdriver.Chrome() url = 'https://soundcloud.com/charts/top?genre=all-music&country=all-countries' browser.get(url) time.sleep(3) for song_number in range(1, how_many_songs+1): play = browser.find_elements_by_xpath('//*[@id="content"]/div/div/div[1]/div[2]/div/div[3]/ul/li[{}]/div/div[2]/div[2]/a'.format(song_number))[0] hover = ActionChains(browser).move_to_element(play) hover.perform() play.click() time.sleep(time_to__play) browser.quit() if __name__ == '__main__': how_many_songs = 5 time_to__play = 15 # sec play_song(how_many_songs, time_to__play) RE: Web-scraping part-2 - metulburr - Oct-31-2016 I think the following should be included into the tutorial pertaining to selenium. proper waiting instead of using time.sleep Sometimes you need the browser to just wait while the page is loading, otherwise it will fail because the content is not yet loaded. Instead of arbitrarily waiting X number of seconds (time.sleep), you can use WebDriverWait to wait... lets say until the element you are looking for exists. Then you are not waiting longer than needed, or possibly too short and fail as well. from selenium.webdriver.common.by import By from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.support import expected_conditions as EC ... WebDriverWait(browser, 3).until(EC.presence_of_element_located((By.ID, 'global-new-tweet-button')))That this will wait for the presence of the element with the ID of "global-new-tweet-button". It will timeout after 3 seconds of not finding it. You can of course extend this timeout as needed. The presence of the element located and ID is not the only thing we can search for. Below is the list of built-in methods to search for elements based on circumstances and content. references These are a list of convenience methods in selenium that are common to use to search for elements These are a list of locating methods in selenium that are common to use to search for elements You can find the definition of each expected support condition here. more info: https://selenium-python.readthedocs.io/waits.html https://selenium-python.readthedocs.io/locating-elements.html performing key combos Sometimes we want to perform key combinations to do things in the browser. from selenium.webdriver.common.action_chains import ActionChains from selenium.webdriver.common.keys import Keys ActionChains(browser).key_down(Keys.COMMAND).send_keys("s").key_up(Keys.COMMAND).perform()Where in this specific example in Firefox will execute Ctrl+S to bring up the save as menu. switching or opening tabs Switching tabs is often used as selecting things may bring up data in a whole different tab. Thus we need to switch to and from these tabs. # Opens a new tab driver.execute_script("window.open()") # Switch to the newly opened tab driver.switch_to.window(driver.window_handles[1]) # Navigate to new URL in new tab driver.get("https://google.com") # Run other commands in the new tab here You're then able to close the original tab as follows # Switch to original tab driver.switch_to.window(driver.window_handles[0]) # Close original tab driver.close() # Switch back to newly opened tab, which is now in position 0 driver.switch_to.window(driver.window_handles[0]) Or close the newly opened tab # Close current tab driver.close() # Switch back to original tab driver.switch_to.window(driver.window_handles[0])scrolling to the bottom of the page regardless of length This in the cases where pages do not load the entire page until you scroll such as facebook. This will scroll to the bottom of the page, let it wait to load the rest (via time.sleep be aware), and keep repeating until it is at the bottom. To make this more portable it is using time.sleep, but you can wait for a specific element in your website if needed to be faster. def scroll_to_bottom(driver): #driver = self.browser SCROLL_PAUSE_TIME = 0.5 # Get scroll height last_height = driver.execute_script("return document.body.scrollHeight") while True: # Scroll down to bottom driver.execute_script("window.scrollTo(0, document.body.scrollHeight);") # Wait to load page time.sleep(SCROLL_PAUSE_TIME) # Calculate new scroll height and compare with last scroll height new_height = driver.execute_script("return document.body.scrollHeight") if new_height == last_height: break last_height = new_height #call scroll_to_bottom(browser) when you want it to scroll to the bottom of the pageHandle exceptions with built-in's: Use a try and except to get you where you want to go >>> import selenium.common.exceptions as EX >>> help(EX) Does the site use Javascript in the first place? An easy way to test if Javacript is blocking you in the first place is to turn off javascript on your browser and reload the website. If what you are parsing is missing, then its a quick way to determine it is generated by javascript...requiring selenium. Another way is to check the javascript source code on the website regarding the element you are parsing. If there is a javascript call in the header, then you will need selenium to parse it. Search for unique elements Often you are parsing sites that do not want a bot to parse them. You need to find a unique element for the content you are parsing. If it does not have one, then search higher in the HTML for one to start a point of reference for the element you are looking for. Then work your way down further to the exact element. More often than not the ID us unique enough. By far the quickest way is to search for the xpath of the element. But note that this can change over time. Websites change over time and can break your code. You will need to update the code as the website changes. RE: Web-scraping part-2 - metulburr - Oct-31-2016 Are you the one that used lxml a lot? It would be nice to see a side by side comparison of scraping with BS and lxml. RE: Web-scraping part-2 - snippsat - Oct-31-2016 Quote:Are you the one that used lxml a lot?I used it more before alone,still use it but now mostly as parser trough BeautifulSoup(url_get.content, 'lxml'). So then BS get speed of lxml parser. I use BeautifulSoup(url_get.content, 'html.parser') in tutorial,because then no need to install lxml. RE: Web-scraping part-2 - metulburr - Jan-29-2017 I was thinking more of xpath method, something like an alternative to BeautifulSoup side by side from lxml import etree from bs4 import BeautifulSoup html = '<html><head><title>foo</title></head><body><div class="other"></div><div class="name"><p>foo</p></div><div class="name"><ul><li>bar</li></ul></div></body></html>' tree = etree.fromstring(html) for elem in tree.xpath("//div[@class='name']"): print(etree.tostring(elem, pretty_print=True)) soup = BeautifulSoup(html, 'lxml') for elem in soup.find_all('div', {'class':'name'}): print(elem.prettify()) RE: Web-scraping part-2 - snippsat - Jan-29-2017 (Jan-29-2017, 01:35 AM)metulburr Wrote: I was thinking more of xpath method, something like an alternative to BeautifulSoup side by sideYeah can make a comparison,been a while since i used lxml. So i want to use Python 3.6 and new install of BS and lxml. Then virtual environment is the choice. Here the install: Using BS class_ call instead of the dict call,and BS always return Unicode.For lxml use encoding='unicode' to get Unicode when using pretty print.See that to get text( Hulk ) is similar for both soup.find('p').text)
CSS selector Both BS and lxml(also XPath) support CSS selector. Need to install CSS selector for lxml pip install cssselect-1.0.1-py2.py3-none-any.whl .Here a quick tutorial in this Pen. See that i change color on the text, now using same method to scrape the content .
RE: Web-scraping part-2 - snippsat - Jan-30-2017 So here gone look a getting source from the web with BeautifulSoup and lxml. For both BS and lxml(aslo has it's own method) is advisable to use Requests. So i install Requests into my virtual environment: Gone use python.org as example.We are getting the head tag which is <title>Welcome to Python.org</title> .As mention before in part-1 use Developer Tools Chrome and FireFox(earlier FireBug) to navigate/inspect web-site. So using method over XPath /html/head/title and CSS selector head > title ,to get the head title tag.
RE: Web-scraping part-2 - metulburr - Oct-21-2017 I think there should be like a "block section" added to the tutorial. Hindrances to scraping; like identifying and switching to an iframe. As well as identifying if there is a JSON for the data scraping to not have to scrape at all in the first place. RE: Web-scraping part-2 - snippsat - Apr-01-2018 Bump part-2 is updated. RE: Web-scraping part-2 - metulburr - Oct-15-2018 based on this https://python-forum.io/Thread-Headless-browser?pid=60628#pid60628 Does that mean its better to use add_argument('--headless') rather than set_headless()? |