Python Forum
Cannot get contents from ul.li.span.string
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Cannot get contents from ul.li.span.string
#1
Hi,
I'd like to scrape string contents from a website, but it didn't work and I don't know how to solve it. Any ideas about it would be much grateful.
Here is detail info:
Website: https://voteview.com/rollcall/RH1030237
Scrape: All the Congressman's "name" "state" "vote" - Last part of the website: Votes (Sort by Party,State,Vote,Ideology,Vote Probability)
Here is my code:
import requests
import bs4
from bs4 import BeautifulSoup
def getHTMLText(url):
    try:
        r = requests.get(url)
        r.encoding = r.apparent_encoding
        return r.text
    except:
        return ""
def fillPList(plist, html):
        soup = BeautifulSoup(html, "html.parser")
        for li in soup.find('ul'):
            if isinstance(li, bs4.element.Tag):
                spans = li('span')
                plist.append([spans[0].string, spans[1].string, spans[2].string])
def printPList(plist, num):
    print("{:^10}\t{:^10}\t{:^10}".format("name", "state", "vote"))
    for i in range(num):
        p = plist[i]
        print("{:^10}\t{:^10}\t{:^10}".format(p[0], p[1], p[2]))
def main():
    pinfo = []
    url = 'https://voteview.com/rollcall/RH1030237'
    html = getHTMLText(url)
    fillPList(pinfo, html)
    printPList(pinfo, 435)
    with open(r'D:\KKKKKK\103_hr1876.csv', 'a', encoding='utf-8') as f:
        f.write("{},{},{}\n".format("name", "state", "vote"))
main()
Here is error I got:
Error:
Traceback (most recent call last): File "D:/AA_Software/Pycharm/PycharmProjects/untitled/voteview.py", line 30, in <module> main() File "D:/AA_Software/Pycharm/PycharmProjects/untitled/voteview.py", line 26, in main fillPList(pinfo, html) File "D:/AA_Software/Pycharm/PycharmProjects/untitled/voteview.py", line 16, in fillPList plist.append([spans[0].string, spans[1].string, spans[2].string]) IndexError: list index out of range
I have done some work about this error, it's said the list is none, so there will be error when you try to print the list. But there are contents in the website source code.
This is my first post here, sorry if this post confuses you and please tell me to improve, many thanks!
Reply
#2
This page uses javaScript and the page that contains all of the congressmen is not visable until the javascript has been executed.
You will need to use selenium to scrape the information
Reply
#3
(Nov-15-2019, 07:28 AM)Larz60+ Wrote: This page uses javaScript and the page that contains all of the congressmen is not visable until the javascript has been executed.
You will need to use selenium to scrape the information
Many thanks Larz! I know the problem with my code now. Will learn selenium, thank you again!
Reply
#4
I'd like to suggest the tutorial by snippsatt:
Part 2
part 1Part
Reply
#5
I took a stab at it, I redirected the display to a text file which I've attached.
I recommend creating a json file to save the dictionary that's built (cdict), then you can use that as input to other scripts

code:
from selenium import webdriver
from selenium.webdriver.common.by import By
import time
from pathlib import Path
import bs4
from bs4 import BeautifulSoup
import os


def add_node(parent, nodename):
    node = parent[nodename] = {}
    return node

def add_cell(nodename, cellname, value):
    cell =  nodename[cellname] = value
    return cell

def display_dict(dictname, level=0):
    indent = " " * (4 * level)
    for key, value in dictname.items():
        if isinstance(value, dict):
            print(f'\n{indent}{key}')
            level += 1
            display_dict(value, level)
        else:
            print(f'{indent}{key}: {value}')
        if level > 0:
            level -= 1

def getCongressmenInfo(url):
    baseurl = 'https://voteview.com'
    cdict = {}

    savefile = Path('.') / 'soupsave.dat'
    if not savefile.exists():
        caps = webdriver.DesiredCapabilities().FIREFOX
        caps["marionette"] = True
        browser = webdriver.Firefox(capabilities=caps)
        browser.get(url)
        time.sleep(2)
        page = browser.page_source
        with savefile.open('w') as fp:
            fp.write(page)
        browser.close()
    else:
        with savefile.open('r') as fp:
            page = fp.read()

    soup = BeautifulSoup(page, 'lxml')
    ul = soup.find('ul', {'class': 'voteTable columns4'})

    lis = ul.find_all('li')
    congNo = 2

    party = None
    try:
        for n, li in enumerate(lis):
            # if no link in li, it must be party designator
            if not li.a:
                party = li.text.strip()
                continue

            # create new node for congressman
            cnode = add_node(cdict, li.a.text.strip())
            add_cell(cnode, 'Party', party)

            link = f"{baseurl}{li.a.get('href')}"
            # Need to encode for json -- don't forget to decode when loading json file
            add_cell(cnode, 'link', os.fspath(link))

            add_cell(cnode, 'Name', {li.a.text.strip()})

            state_element = soup.select(f"li.voter:nth-child({congNo}) > span:nth-child(1) > span:nth-child(2)")[0]
            add_cell(cnode, 'State', {state_element.text.strip()})

            vote_element = soup.select(f"li.voter:nth-child({congNo}) > span:nth-child(2)")[0]
            add_cell(cnode, 'Vote', {vote_element.text.strip()})

            congNo += 1
    except IndexError:
        # cheat to use index error for end of list
        display_dict(cdict)

    # suggest saving cdist as json file here

def main():
    # make sure in  script directory
    os.chdir(os.path.abspath(os.path.dirname(__file__)))

    congressman_dict = {}
    url = 'https://voteview.com/rollcall/RH1030237'
    getCongressmenInfo(url)


if __name__ == '__main__':
    main()
Partial results (full text file attached, again I recommend saving as JSON)
Output:
ABERCROMBIE Party: Democratic Party link: https://voteview.com//person/15245/neil-abercrombie Name: {'ABERCROMBIE'} State: {'(HI)'} Vote: {'N'} ACKERMAN Party: Democratic Party link: https://voteview.com//person/15000/gary-leonard-ackerman Name: {'ACKERMAN'} State: {'(NY)'} Vote: {'Y'} ANDREWS, Michael Party: Democratic Party link: https://voteview.com//person/15001/michael-allen-andrews Name: {'ANDREWS, Michael'} State: {'(TX)'} Vote: {'Y'} ANDREWS, Robert Party: Democratic Party link: https://voteview.com//person/29132/robert-ernest-andrews Name: {'ANDREWS, Robert'} State: {'(NJ)'} Vote: {'N'} ANDREWS, Thomas Party: Democratic Party link: https://voteview.com//person/29121/thomas-hiram-andrews Name: {'ANDREWS, Thomas'} State: {'(ME)'} Vote: {'N'} APPLEGATE Party: Democratic Party link: https://voteview.com//person/14402/douglas-earl-applegate Name: {'APPLEGATE'} State: {'(OH)'} Vote: {'N'}

Attached Files

.txt   CongressionalVotes.txt (Size: 40.41 KB / Downloads: 201)
Reply
#6
(Nov-15-2019, 07:28 AM)Larz60+ Wrote: This page uses javaScript and the page that contains all of the congressmen is not visable until the javascript has been executed.
You will need to use selenium to scrape the information

Thank you for your advice. I was looking for a solution to this issue, too.
Reply
#7
(Nov-15-2019, 07:28 AM)Larz60+ Wrote: This page uses javaScript and the page that contains all of the congressmen is not visable until the javascript has been executed.
You will need to use selenium to scrape the information

Totally agree here. I tried to think of some other solution that wouldn't require knowledge of Selenium, still, nothing popped up in my mind Confused
Reply
#8
Quote:Totally agree here.
I also have example in post above
Reply
#9
(Nov-18-2019, 11:00 AM)alekson Wrote: I took a stab at it, I redirected the display to a text file which I've attached.
I recommend creating a json file to save the dictionary that's built (cdict), then you can use that as input to other scripts
Wow! Many thanks Larz!! You are very very nice, will learn your code carefully! Thanks again!(๑•̀ㅂ•́)و✧
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  trying to scrape a span inside a div using beautifulsoup CompleteNewb 2 9,115 Jan-28-2021, 01:20 PM
Last Post: snippsat
  select all the span text with same attribute JennyYang 2 2,143 Jul-28-2020, 02:56 PM
Last Post: snippsat
  Scrap a dynamic span hefaz 0 2,698 Mar-07-2020, 02:56 PM
Last Post: hefaz
  selenium click a span tag metulburr 1 21,955 Nov-30-2016, 05:47 PM
Last Post: snippsat

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020