Python Forum
Combining selenium and beautifulsoup for web scraping
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Combining selenium and beautifulsoup for web scraping
#1
I worked with beautifulsoup to scrap data from webpages and used selenium to enter web pages that needs login.
Now my question is can I enter into webpages using selenium and then scrape data from there using beautifulsoup?
I observed that beautifulsoup never work on those pages needs login, so I need to login first to that website using selenium. If beautifulsoup will not work in this like of webpages where I need login, then how to scrap images or a texts associated with particular html tag on that page? I know it's very easy to scrap texts or images using beautifulsoup and urlretrieve from the normal websites.
Reply
#2
(Jan-29-2018, 07:05 AM)sumandas89 Wrote: can I enter into webpages using selenium and then scrape data from there using beautifulsoup?
Yes if you need to get past javascript, you can use selenium to get the full page content and pass it to BS.
from selenium import webdriver
driver = webdriver.Firefox()
driver.get(WEBSITE)
#delay of some kind wait for load time.sleep(3) or selenium wait for an element to be visible
soup = BeautifulSoup(driver.page_source, 'html')
However selenium has methods to get navigate HTML, as you will need it to get past multiple javascript pages/mouse clicks. So it depends really on whether you need BS after already using selenium.

(Jan-29-2018, 07:05 AM)sumandas89 Wrote: I observed that beautifulsoup never work on those pages needs login, so I need to login first to that website using selenium
You can login to website with requests module and saving cookies, etc. Selenium is not required to login to a website unless it has javascript.
Recommended Tutorials:
Reply
#3
(Jan-29-2018, 07:21 AM)metulburr Wrote:
(Jan-29-2018, 07:05 AM)sumandas89 Wrote: can I enter into webpages using selenium and then scrape data from there using beautifulsoup?
Yes if you need to get past javascript, you can use selenium to get the full page content and pass it to BS.
from selenium import webdriver
driver = webdriver.Firefox()
driver.get(WEBSITE)
#delay of some kind wait for load time.sleep(3) or selenium wait for an element to be visible
soup = BeautifulSoup(driver.page_source, 'html')
However selenium has methods to get navigate HTML, as you will need it to get past multiple javascript pages/mouse clicks. So it depends really on whether you need BS after already using selenium.

(Jan-29-2018, 07:05 AM)sumandas89 Wrote: I observed that beautifulsoup never work on those pages needs login, so I need to login first to that website using selenium
You can login to website with requests module and saving cookies, etc. Selenium is not required to login to a website unless it has javascript.

I seen that this solution sometimes doesn't work. It happens that some contents are available and in the pages but not available in the page source though data are available in the web pages. This behaviour I seen particularly in case of facebook and found no solution for it.
Reply
#4
Specifically facebook is hard by itself solely because they are trying to stop bots from doing anything on their site, where other sites cannot afford such measures, etc. Any site that has an API is going to try to stop bots, as they want people to use their API to limit their access. But facebook is probably the worst because of their "unlimited funds". Ive seen code try to break bots by randomizing their ID names each session, or they nest iframes to not show the code in the initial source, etc. Along with the usual javascript to break basic bot measures, and obfuscate their code. Usually though its not that, but your code.

If the site you are scraping is facebook, then yes you need selenium.

(Jan-30-2018, 01:39 PM)sumandas89 Wrote: It happens that some contents are available and in the pages but not available in the page source
Sometimes they add things in iframes so that you have to switch to that window to be able to scrape it. But i am unsure as you have not said the exact page you are looking for.

Show your code, and explain what page on facebook (if possible) that you are having trouble scraping and i can see try to see why. If you search these forums there has been previous discussions about scraping facebook and examples given of some basic tasks already. For example:
https://python-forum.io/Thread-facebook-...t=facebook
Recommended Tutorials:
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  Scraping div tags with selenium, need help hfakoor2 1 1,032 Mar-12-2023, 08:31 AM
Last Post: hfakoor2
  Scraping based on years BeautifulSoup rhat398 0 1,734 May-22-2021, 07:20 PM
Last Post: rhat398
  Web scraping cookie in URL blocks selenium Alex06 2 2,384 Jan-10-2021, 01:43 PM
Last Post: Alex06
  Extract data with Selenium and BeautifulSoup nestor 3 3,816 Jun-06-2020, 01:34 AM
Last Post: Larz60+
  Beautifulsoup Scraping PolskaYBZ 3 3,148 Jun-22-2019, 10:05 AM
Last Post: PolskaYBZ
  Web scraping (selenium (i think)) Larz60+ 10 6,106 Jan-27-2019, 02:57 AM
Last Post: Larz60+
  Web Page not opening while web scraping through python selenium sumandas89 4 9,998 Nov-19-2018, 02:47 PM
Last Post: snippsat
  web scraping with selenium and bs4 Prince_Bhatia 2 3,721 Sep-18-2018, 10:59 AM
Last Post: Prince_Bhatia
  scraping javascript websites with selenium DoctorEvil 1 3,314 Jun-08-2018, 06:40 PM
Last Post: DoctorEvil
  web scraping using selenium sumandas89 3 3,542 Jan-05-2018, 01:45 PM
Last Post: metulburr

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020