Python Forum

Hi,
I'd like to scrape string contents from a website, but it didn't work and I don't know how to solve it. Any ideas about it would be much grateful.
Here is detail info:
Website: https://voteview.com/rollcall/RH1030237
Scrape: All the Congressman's "name" "state" "vote" - Last part of the website: Votes （Sort by Party,State,Vote,Ideology,Vote Probability）
Here is my code:

import requests
import bs4
from bs4 import BeautifulSoup
def getHTMLText(url):
    try:
        r = requests.get(url)
        r.encoding = r.apparent_encoding
        return r.text
    except:
        return ""
def fillPList(plist, html):
        soup = BeautifulSoup(html, "html.parser")
        for li in soup.find('ul'):
            if isinstance(li, bs4.element.Tag):
                spans = li('span')
                plist.append([spans[0].string, spans[1].string, spans[2].string])
def printPList(plist, num):
    print("{:^10}\t{:^10}\t{:^10}".format("name", "state", "vote"))
    for i in range(num):
        p = plist[i]
        print("{:^10}\t{:^10}\t{:^10}".format(p[0], p[1], p[2]))
def main():
    pinfo = []
    url = 'https://voteview.com/rollcall/RH1030237'
    html = getHTMLText(url)
    fillPList(pinfo, html)
    printPList(pinfo, 435)
    with open(r'D:\KKKKKK\103_hr1876.csv', 'a', encoding='utf-8') as f:
        f.write("{},{},{}\n".format("name", "state", "vote"))
main()

Here is error I got:

Error:Traceback (most recent call last):
  File "D:/AA_Software/Pycharm/PycharmProjects/untitled/voteview.py", line 30, in <module>
    main()
  File "D:/AA_Software/Pycharm/PycharmProjects/untitled/voteview.py", line 26, in main
    fillPList(pinfo, html)
  File "D:/AA_Software/Pycharm/PycharmProjects/untitled/voteview.py", line 16, in fillPList
    plist.append([spans[0].string, spans[1].string, spans[2].string])
IndexError: list index out of range

I have done some work about this error, it's said the list is none, so there will be error when you try to print the list. But there are contents in the website source code.
This is my first post here, sorry if this post confuses you and please tell me to improve, many thanks!

This page uses javaScript and the page that contains all of the congressmen is not visable until the javascript has been executed.
You will need to use selenium to scrape the information

(Nov-15-2019, 07:28 AM)Larz60+ Wrote: [ -> ]This page uses javaScript and the page that contains all of the congressmen is not visable until the javascript has been executed.
You will need to use selenium to scrape the information

Many thanks Larz! I know the problem with my code now. Will learn selenium, thank you again!

I'd like to suggest the tutorial by snippsatt:
Part 2
part 1Part

I took a stab at it, I redirected the display to a text file which I've attached.
I recommend creating a json file to save the dictionary that's built (cdict), then you can use that as input to other scripts

code:

from selenium import webdriver
from selenium.webdriver.common.by import By
import time
from pathlib import Path
import bs4
from bs4 import BeautifulSoup
import os


def add_node(parent, nodename):
    node = parent[nodename] = {}
    return node

def add_cell(nodename, cellname, value):
    cell =  nodename[cellname] = value
    return cell

def display_dict(dictname, level=0):
    indent = " " * (4 * level)
    for key, value in dictname.items():
        if isinstance(value, dict):
            print(f'\n{indent}{key}')
            level += 1
            display_dict(value, level)
        else:
            print(f'{indent}{key}: {value}')
        if level > 0:
            level -= 1

def getCongressmenInfo(url):
    baseurl = 'https://voteview.com'
    cdict = {}

    savefile = Path('.') / 'soupsave.dat'
    if not savefile.exists():
        caps = webdriver.DesiredCapabilities().FIREFOX
        caps["marionette"] = True
        browser = webdriver.Firefox(capabilities=caps)
        browser.get(url)
        time.sleep(2)
        page = browser.page_source
        with savefile.open('w') as fp:
            fp.write(page)
        browser.close()
    else:
        with savefile.open('r') as fp:
            page = fp.read()

    soup = BeautifulSoup(page, 'lxml')
    ul = soup.find('ul', {'class': 'voteTable columns4'})

    lis = ul.find_all('li')
    congNo = 2

    party = None
    try:
        for n, li in enumerate(lis):
            # if no link in li, it must be party designator
            if not li.a:
                party = li.text.strip()
                continue

            # create new node for congressman
            cnode = add_node(cdict, li.a.text.strip())
            add_cell(cnode, 'Party', party)

            link = f"{baseurl}{li.a.get('href')}"
            # Need to encode for json -- don't forget to decode when loading json file
            add_cell(cnode, 'link', os.fspath(link))

            add_cell(cnode, 'Name', {li.a.text.strip()})

            state_element = soup.select(f"li.voter:nth-child({congNo}) > span:nth-child(1) > span:nth-child(2)")[0]
            add_cell(cnode, 'State', {state_element.text.strip()})

            vote_element = soup.select(f"li.voter:nth-child({congNo}) > span:nth-child(2)")[0]
            add_cell(cnode, 'Vote', {vote_element.text.strip()})

            congNo += 1
    except IndexError:
        # cheat to use index error for end of list
        display_dict(cdict)

    # suggest saving cdist as json file here

def main():
    # make sure in  script directory
    os.chdir(os.path.abspath(os.path.dirname(__file__)))

    congressman_dict = {}
    url = 'https://voteview.com/rollcall/RH1030237'
    getCongressmenInfo(url)


if __name__ == '__main__':
    main()

Partial results (full text file attached, again I recommend saving as JSON)

Output:ABERCROMBIE
    Party: Democratic Party
    link: https://voteview.com//person/15245/neil-abercrombie
    Name: {'ABERCROMBIE'}
    State: {'(HI)'}
    Vote: {'N'}

ACKERMAN
    Party: Democratic Party
    link: https://voteview.com//person/15000/gary-leonard-ackerman
    Name: {'ACKERMAN'}
    State: {'(NY)'}
    Vote: {'Y'}

ANDREWS, Michael
    Party: Democratic Party
    link: https://voteview.com//person/15001/michael-allen-andrews
    Name: {'ANDREWS, Michael'}
    State: {'(TX)'}
    Vote: {'Y'}

ANDREWS, Robert
    Party: Democratic Party
    link: https://voteview.com//person/29132/robert-ernest-andrews
    Name: {'ANDREWS, Robert'}
    State: {'(NJ)'}
    Vote: {'N'}

ANDREWS, Thomas
    Party: Democratic Party
    link: https://voteview.com//person/29121/thomas-hiram-andrews
    Name: {'ANDREWS, Thomas'}
    State: {'(ME)'}
    Vote: {'N'}

APPLEGATE
    Party: Democratic Party
    link: https://voteview.com//person/14402/douglas-earl-applegate
    Name: {'APPLEGATE'}
    State: {'(OH)'}
    Vote: {'N'}

(Nov-15-2019, 07:28 AM)Larz60+ Wrote: [ -> ]This page uses javaScript and the page that contains all of the congressmen is not visable until the javascript has been executed.
You will need to use selenium to scrape the information

Thank you for your advice. I was looking for a solution to this issue, too.

(Nov-15-2019, 07:28 AM)Larz60+ Wrote: [ -> ]This page uses javaScript and the page that contains all of the congressmen is not visable until the javascript has been executed.
You will need to use selenium to scrape the information

Totally agree here. I tried to think of some other solution that wouldn't require knowledge of Selenium, still, nothing popped up in my mind Confused

Quote:Totally agree here.

I also have example in post above

(Nov-18-2019, 11:00 AM)alekson Wrote: [ -> ]I took a stab at it, I redirected the display to a text file which I've attached.
I recommend creating a json file to save the dictionary that's built (cdict), then you can use that as input to other scripts

Wow! Many thanks Larz!! You are very very nice, will learn your code carefully! Thanks again!(๑•̀ㅂ•́)و✧

LLLLLL

Larz60+

LLLLLL

Larz60+

Larz60+

alekson

ChislaineWijdeven

Larz60+

LLLLLL