web scraping problem

jacksfrustration · May-30-2024, 12:07 PM

im trying to build a web scraper using BS4. i want to filter news articles and only get the href links of all the articles that have above a number of votes in them. for example my code looks something like

import requests
from bs4 import BeautifulSoup


res = requests.get("https://news.ycombinator.com/")
soup = BeautifulSoup(res.text,"html.parser")
titles=soup.select(".titleline")
votes=soup.select(".score")
selected_titles=[]
amount_of_votes=int(input("How many votes should the articles have?\n"))



for index,vote in enumerate(votes):
    if int(vote.text.split()[0]) > amount_of_votes:
        href=titles[index].get('href')
        print(href)
        selected_titles.append({"title":titles[index].getText(),"link":href})

i am printing the href in the condition to check wether it is working fine. my problem is that all of the href values that are generated are None. i am doing a bootcamp and the instructors code was:

        href=titles[index].get('href',None)

I have tried both the first way i posted it and the second way but all i get for the href values are None. I want to only get the url links of articles that have more than a number of votes. Please help

Pedroski55 · May-30-2024, 04:22 PM

The problem is, I think:

href=titles[index].get('href')

Try:

hres = titles[index].find('a', href=True)
print(hres['href'])

Altogether now:

import requests
from bs4 import BeautifulSoup

amount_of_votes = int(input("How many votes should the articles have?\n"))

def myApp(aov):    
    res = requests.get("https://news.ycombinator.com/")
    soup = BeautifulSoup(res.text,"html.parser")
    titles=soup.find_all("span", class_="titleline")
    votes = soup.find_all('span', class_='score')
    selected_titles=[]
    for index,vote in enumerate(votes):    
        resv = vote.text.split()[0]
        print(resv)
        if int(resv) > aov:
            print(index)
            hres = titles[index].find('a', href=True)
            print(hres['href'])
            selected_titles.append({"title":titles[index].getText(),"link":hres['href']})

    return selected_titles

result = myApp(amount_of_votes)
for r in result:
    print(r)

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	Problem with Web Cursor and Chrome Extension Interaction During Web Scraping	ScraperHelge	0	494	Mar-26-2025, 03:47 PM Last Post: ScraperHelge
	Problem with scraping the Title from a web page	Wagner822	0	1,258	Jun-29-2022, 11:31 PM Last Post: Wagner822

web scraping problem

User Panel Messages

Announcements