Web Scraping Project

dyzl3xik · Apr-28-2019, 02:48 AM

Hello,

I am pretty new to Python and for our last semester project, we are to develop a web scraper program. We are to go into the Itunes charts and pull the top 100 for the user selected category. Once the information is pulled, the rankings are to be stored in a txt file. The user is then to input which ranking they want more information on. The program is to pull the data from the txt file and display the information. The program is to continue until the user wants to exit.

I am able to pull the chart information and write the txt file. However, when I enter a ranking that I want more details on, I am having some issues. I can only pull the data for the first 67 entries in the txt file; any number above that seems to crash the kernel. I am stuck and can't seem to figure out what I am doing wrong. I would appreciate any insight as to what I did wrong in the code I have included below.

Thanks.

import sys
from bs4 import BeautifulSoup as soup
from urllib.request import urlopen as uReq

#this is a collection of the nine urls needed to access the charts. Since only one category can be accessed at once,
#I assumed the info from the category selected would be the only one written into the text file.

song_url ='https://www.apple.com/itunes/charts/songs'
albums_url ='https://www.apple.com/itunes/charts/albums/'
fapps_url ='https://www.apple.com/itunes/charts/free-apps/'
papps_url = 'https://www.apple.com/itunes/charts/paid-apps/'
tapps_url = 'https://www.apple.com/itunes/charts/top-grossing-apps/'
books_url = 'https://www.apple.com/itunes/charts/paid-books/'
movies_url = 'https://www.apple.com/itunes/charts/movies/'
shows_url = 'https://www.apple.com/itunes/charts/tv-shows/'
videos_url = 'https://www.apple.com/itunes/charts/music-videos/'

#User instructions for program

def user_inst():
    print('This is a program that scrapes data from https://www.apple.com/itunes/charts for a class project.\n')
    print('What category would you like to know more about?\n')
    print(' (1) - Songs')
    print(' (2) - Albums')
    print(' (3) - Free Apps')
    print(' (4) - Paid Apps')
    print(' (5) - Top Grossing Apps')
    print(' (6) - Books')
    print(' (7) - Movies')
    print(' (8) - TV Shows')
    print(' (9) - Music Videos' )
    print(' (10) - Exit Program')

#asks user for category selection and error checks input

def user_select():
    while True:
        try:
            selection = int(input('What category would you like to know more about?\n'))
        except ValueError:
            print('Please enter a valid number')
        if selection <=10 and selection != 0:
            return selection
        else:
            print('Please select a valid number')
            continue

def url_select(selection):
    if selection == 1:
        return song_url
    elif selection == 2:
        return albums_url
    elif selection == 3:
        return fapps_url
    elif selection == 4:
        return papps_url
    elif selection == 5:
        return tapps_url
    elif selection == 6:
        return books_url
    elif selection == 7:
        return movies_url
    elif selection == 8:
        return shows_url
    else:
        return videos_url
        

#Attempts to make a connection to the Itunes site

def make_connection(url):
    try:
        uClient = uReq(url)
        html = uClient.read()
        uClient.close()
        return html
    except:
        print('Could not connect to the site.\n')
        sys.exit(1)

#uses BeautifulSoup library to clean up HTML and make it more manageable.

def make_soup(html):
    soup_bowl = soup(html, 'html.parser')
    match = soup_bowl.find_all('div', class_='section-content')
    content = match[1].ul
    return content

def file_create(pick):
    while pick != 10:
        url = url_select(pick)
        page_html = make_connection(url)
        page_soup = make_soup(page_html)
        filename = 'rankings.txt'
        cat_list=[]
        with open(filename, 'w', encoding = 'utf-8') as f:
            for li in page_soup.findAll('li'):
                for strong in li.findAll('strong'):
                    rank = strong.text
                    rank = rank[:-1]
                for h3 in li.findAll('h3'):
                    title = h3.text
                for h4 in li.findAll('h4'):
                    artist = h4.text
                cat_list.append(rank)
                cat_list.extend(title)
                cat_list.extend(artist)
                f.write(rank + ','+ title + ',' + artist + '\n')
            f.close()
            print('Rankings saved in file named rankings.txt.\n')
            return
    if choice == 10:
        print('Goodbye!')
        quit()


#asks user for rank they would like more information on and error checks the input

def user_rank():
    while True:
        try:
            selection = int(input('What ranking would you like to know more about?\n'))
        except ValueError:
            print('Please enter a valid number')
        if selection <=100 and selection != 0:
            return selection
        else:
            print('Please select a valid number')
            continue

def main():
    user_inst()
    choice = user_select()
    file_create(choice)
    u_rank = user_rank()
    with open('ratings.txt','r') as fhand:
        for line in fhand:
            if str(u_rank) in line:
            #splitlines = line.split(',')
            #print(splitlines)
            #if 
                splitlines = line.split(',')
                #u_rank == int(splitlines[0]):
                u_title = splitlines[1]
                u_artist = splitlines[2]
                print('The information for rank ', u_rank, 'is ', u_title,'-', u_artist)
            #print(splitlines)
                break
    return

main()

**Larz60+** · (This post was last modified: Apr-28-2019, 04:40 AM by Larz60+.)

requests is easier to use than urllib.request
install with pip install requests

# instead of 
from urllib.request import urlopen as uReq
# use
import requests

# instead of
def make_connection(url):
    try:
        uClient = uReq(url)
        html = uClient.read()
        uClient.close()
        return html
    except:
        print('Could not connect to the site.\n')
        sys.exit(1)
# use
def make_connection(url):
    response = requests.get(url)
    if response.status_code == 200:
        return response.content
    else:
        print('Could not connect to the site.\n')
        sys.exit(1)

***ichabod801*** · Apr-28-2019, 11:46 AM

I don't see anything obviously wrong in the code you posted. Are you getting an error? If so, please post the full text of the error. Have you checked the text file to make sure the lines for the later rankings are getting written correctly?

I wonder about line 138: if str(u_rank) in line:. The rank is just a number, which could show up in other fields besides the rank field. From the commented out line 143, it seems rank is the first field. I would probably split the line by commas first, and then check the actual rank. Alternatively, you could do line.startswith('{},'.format(u_rank)).

dyzl3xik · (This post was last modified: Apr-28-2019, 01:10 PM by dyzl3xik.)

I do not receive an error message. The output below represents a successful run. The only difference is when I enter a number above 67, I do not get the last line and the kernel just stops working (I am running the code in Jupyter Notebook). I initially thought there may be an issue in the way that the txt file is being created and overwritten in later trials. I have tried including a print statement to see each line in the txt file and I only get values up to line 67.

Output:This is a program that scrapes data from https://www.apple.com/itunes/charts for a class project.

What category would you like to know more about?

 (1) - Songs
 (2) - Albums
 (3) - Free Apps
 (4) - Paid Apps
 (5) - Top Grossing Apps
 (6) - Books
 (7) - Movies
 (8) - TV Shows
 (9) - Music Videos
 (10) - Exit Program
What category would you like to know more about?
7
Rankings saved in file named rankings.txt.

What ranking would you like to know more about?
67
The information for rank  67 is  Oye Mujer - Raymix

***ichabod801*** · Apr-28-2019, 01:21 PM

Have you actually opened the text file and looked at it? It sounds like you are only getting 67 results. If you only got 67 and tried to get info on #86, The program would just stop with no other output. That would be normal behavior.

dyzl3xik · Apr-28-2019, 06:01 PM

I checked the rankings txt file and there were 100 entries. I went back and checked my code again and I found the issue. In the function I created to write the txt file, I call the file 'rankings.txt.' However, when I go to read the file, I was calling a file named 'ratings.txt' which only had 67 entries. I changed the code and I was able to see the data for entry number 100 with no issues. I had a feeling it had something to do with the txt file, I just couldn't pin point what it was until now. Thanks for all the help.

filename = 'rankings.txt'

    with open('ratings.txt','r') as fhand:

Web Scraping Project

User Panel Messages

Announcements