Python Forum
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
web scraping problem
#1
im trying to build a web scraper using BS4. i want to filter news articles and only get the href links of all the articles that have above a number of votes in them. for example my code looks something like


import requests
from bs4 import BeautifulSoup


res = requests.get("https://news.ycombinator.com/")
soup = BeautifulSoup(res.text,"html.parser")
titles=soup.select(".titleline")
votes=soup.select(".score")
selected_titles=[]
amount_of_votes=int(input("How many votes should the articles have?\n"))



for index,vote in enumerate(votes):
    if int(vote.text.split()[0]) > amount_of_votes:
        href=titles[index].get('href')
        print(href)
        selected_titles.append({"title":titles[index].getText(),"link":href})
i am printing the href in the condition to check wether it is working fine. my problem is that all of the href values that are generated are None. i am doing a bootcamp and the instructors code was:

        href=titles[index].get('href',None)
I have tried both the first way i posted it and the second way but all i get for the href values are None. I want to only get the url links of articles that have more than a number of votes. Please help
Reply
#2
The problem is, I think:

href=titles[index].get('href')
Try:

hres = titles[index].find('a', href=True)
print(hres['href'])
Altogether now:

import requests
from bs4 import BeautifulSoup

amount_of_votes = int(input("How many votes should the articles have?\n"))

def myApp(aov):    
    res = requests.get("https://news.ycombinator.com/")
    soup = BeautifulSoup(res.text,"html.parser")
    titles=soup.find_all("span", class_="titleline")
    votes = soup.find_all('span', class_='score')
    selected_titles=[]
    for index,vote in enumerate(votes):    
        resv = vote.text.split()[0]
        print(resv)
        if int(resv) > aov:
            print(index)
            hres = titles[index].find('a', href=True)
            print(hres['href'])
            selected_titles.append({"title":titles[index].getText(),"link":hres['href']})

    return selected_titles

result = myApp(amount_of_votes)
for r in result:
    print(r)
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  Problem with Web Cursor and Chrome Extension Interaction During Web Scraping ScraperHelge 0 494 Mar-26-2025, 03:47 PM
Last Post: ScraperHelge
  Problem with scraping the Title from a web page Wagner822 0 1,258 Jun-29-2022, 11:31 PM
Last Post: Wagner822

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020