Python Forum

Full Version: [Python / BS4] How to Scrape
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
I'm new to programming and am having trouble scraping with BS4.

I'm webmaster for a popular website (can't share it here, but it uses Disqus comments platform).

I want to scrape the vote count and the message in top comments within a set range.. (scrape comments with 20-200 upvotes).

I noticed that:
  • Vote count should be easy to scrape since the upvote count can be found in the 'a class', example: "count-116"
  • The problem is that 'a class' isn't linked to the message text in any way I can see


I've been playing around with some code working on an example site, but so far no success:

from bs4 import BeautifulSoup
import urllib.request
import re

scrape = urllib.request.urlopen('https://disqus.com/home/discussion/channel-discussdisqus/disqus_leaderboard_what_are_the_best_sports_websites/').read()
#soup = BeautifulSoup(scrape,'lxml')
soup = BeautifulSoup(scrape, 'html.parser')

for elem in soup.find_all('a', src=re.compile('count-116')):
    print (elem['src'])
^ This was my attempt to scrape the 'a' element that contains 'count-116', I was going to run it in a while loop with an increment..

count-20
count-21
count-22

...but sadly it doesn't work.

Can anyone help me understand the proper way?
see in the tutorials section:
Web scraping Part 1
Web scraping Part 2
(Oct-19-2017, 07:57 AM)Larz60+ Wrote: [ -> ]see in the tutorials section:
Web scraping Part 1
Web scraping Part 2

Thanks! Great resources!!

Any advice on how to scrape the message if the likes fall in the specified numerical range?

[Image: cZOQiN22SeCAk3ONOdtA2Q.png]

I'm confident I can scrape the number of likes after playing around with the code for a while, but how would I scrape something that has no unique identifier? I need to connect the likes to the message. It's the logic or process of doing it that's really confusing to me.
post_message = soup.find('div', class_='post-message') # target the div
paras = post_message.find_all('p') # get all 'p' tags from that div
If there are many div elements do this in for loop
post_messages = soup.find_all('div', class_='post-message') # post_messages will holds many divs to iterate over them
for post_message in post_messages:
    paras = post_message.find_all('p')
About the likes. You have to do the same like above but to start with scraping all divs with class 'post-body'. For each scrape all the divs with class post message. For each scrape all the p tags.

After getting the p tags for each post-body div scrape the a tag with the votes Perhaps this is generated with JavaScript so you have to take the page content using selenium.

You will need to install PhantomJS to do it like in the  example below but you can use Chrome or Firefox.
from selenium import webdriver

driver = webdriver.PhantomJS() # webdriver.Firefox() or webdriver.Chrome()
driver.get(url)
html = driver.page_source

soup = BeautifulSoup(html, 'lxml')
You have to use selenium as if you turn of javascript in your browser, you load nothing. Also the entire page is in an iframe. Took me awhile to figure out why the scraping wasnt working. Here is an example of getting by their anti-bot measures. I like to use chrome/firefox at first, so that i can easily troubleshoot it while i look at the code that browser is actually getting, and then switch over to phantomjs after to make it headless.

But once you get by their javascript and iframe to stumble you, then you can scrape like normal.

from selenium import webdriver
from bs4 import BeautifulSoup
import time

URL = 'https://disqus.com/home/discussion/channel-discussdisqus/disqus_leaderboard_what_are_the_best_sports_websites/'

driver = webdriver.Chrome('/home/metulburr/chromedriver')
driver.set_window_position(0,0)

driver.get(URL)
time.sleep(3)
driver.switch_to.frame(driver.find_element_by_tag_name("iframe"))
time.sleep(3)
soup = BeautifulSoup(driver.page_source, 'html.parser')
section = soup.find('section', {'id':'conversation'})
posts = section.find_all('li',{'class':'post'})
for post in posts:
    print(post.find('div', {'class':'post-message '}).p)
If for some reason you wanted to return out of the iframe back to the original page
driver.switch_to.default_content()
Thanks for all the help and advice guys! I have a knack for creating projects that are way above my skill level. Seeing how you solve problems really motivates me a lot to learn.

I would have never thought to get selenium to work with BS4 like that. That's pretty interesting! Also the switch to iframe command is something I never knew about.

Found some great info here too: https://www.guru99.com/handling-iframes-selenium.html

I'm really confused by how that page loads in my normal chrome browser. The Iframe doesn't even show up in the source code for the main html document, how does that make sense? Instead it has its own source code which I can't even access in Firefox. There's a special option in Chrome to see it.

[Image: xv7IbtAoT5m06efURxvhFw.png]

I'm going to play around with the code you guys provided, thanks again!
If you right click -> Inspect -> Console -> Drop down that starts with "top" -> select drop down and you will see the ID of the iframe in chrome

But my browser didnt show the iframe option either via right clicking. The way i found out about the iframe was i printed the page source hat selenium was using and looked at it. Once i saw an iframe tag i knew what the problem was. Printing the source never fails me. Couldnt say much about firefox, as i mostly use chrome and phantomjs.

If selenium cant find a tag and you know its correct, then your next culprit is an iframe usually. Most of the time sites will have less than 3 iframes on a single page, and you dont have to even bother with the ID
Sometime there are several ways to solve  task like this.
Looking closer at it,so are all post stored as JSON.
Then is easier to just parser the JSON with Request.
Load more comments will be cursor=0(first 50 post), cursor=1(load 50 new post) in url address.
Example take out likes from 2 first post.
import requests

url = 'https://disqus.com/api/3.0/threads/listPostsThreaded?limit=50&thread=4946429135&forum=channel-discussdisqus&order=popular&cursor=0%3A0%3A0&api_key=E8Uh5l5fHZ6gD8U3KycjAIAk46f68Zw7C6eW8WSjZvCLXebZ7p0r1yrYDrLilk2F'
r = requests.get(url)
post = r.json()
print(post["response"][0]['likes'])
print(post["response"][1]['likes']) 
Output:
116 33
Nice!!!

How did you know there was a JSON for it?
(Oct-21-2017, 12:00 AM)metulburr Wrote: [ -> ]How did you know there was a JSON for it?
I did not know,inspecting the site and the clue lies in network traffic.

Disqus comment plug-in system is really large 2 Billion monthly unique views,
so some structure most the have and JSON is often a choice for a web API.
They also use Python Django for back end,and JavaScript for front end.