Python Forum

Hi All,

I am trying to adapt a program to scrape data from a cisco phone web page. I found some code at https://srome.github.io/Parsing-HTML-Tab...nd-pandas/
that I felt would get me pretty close to what I needed.
However when I run the code below, I get the errors shown below that. Im thinking that perhaps Im running a different version of python or one of the imports, and the syntax has changed. Im new to python - is this a common problem? or can someone else spot some other error? Im using the same URL used in the demo at the link shown above so I would expect the same results.

I want to get the demo running before I start modifying it much, but havent been able to reach that point.

thanks for your help.

#from html.parser import HTMLParser
#from html.entities import name2codepoint

import pandas as pd
from bs4 import BeautifulSoup
import requests
   
url = "https://www.fantasypros.com/nfl/reports/leaders/qb.php?year=2015" 
# response = requests.get(url2)
# response.text[:100] # Access the HTML with the text property
# print(response.text[:100])

class HTMLTableParser:
       
    def parse_url(self, url):
        response = requests.get(url)
        soup = BeautifulSoup(response.text, 'lxml')
        tables = soup.findAll("table")

        for table in tables:
           if table.findParent("table") is None:
               test = str(table)

        return [(table['id'],self.parse_html_table(table))\
                for table in tables]  
#                for table in soup.find_All('table')]  
    
    def parse_html_table(self, table):
            n_columns = 0
            n_rows=0
            column_names = []
    
        # Find number of rows and columns
            # we also find the column titles if we can
            for row in table.find_all('tr'):
                
                # Determine the number of rows in the table
                td_tags = row.find_all('td')
                if len(td_tags) > 0:
                    n_rows+=1
                    if n_columns == 0:
                        # Set the number of columns for our table
                        n_columns = len(td_tags)
                        
                # Handle column names if we find them
                th_tags = row.find_all('th') 
                if len(th_tags) > 0 and len(column_names) == 0:
                    for th in th_tags:
                        column_names.append(th.get_text())
    
            # Safeguard on Column Titles
            if len(column_names) > 0 and len(column_names) != n_columns:
                raise Exception("Column titles do not match the number of columns")
    
            columns = column_names if len(column_names) > 0 else range(0,n_columns)
            df = pd.DataFrame(columns = columns,
                              index= range(0,n_rows))
            row_marker = 0
            for row in table.find_all('tr'):
                column_marker = 0
                columns = row.find_all('td')
                for column in columns:
                    df.iat[row_marker,column_marker] = column.get_text()
                    column_marker += 1
                if len(columns) > 0:
                    row_marker += 1
                    
            # Convert to float if possible
            for col in df:
                try:
                    df[col] = df[col].astype(float)
                except ValueError:
                    pass
            
            return df
 

hp = HTMLTableParser()
#table = hp.parse_url(url)[0][1] # Grabbing the table from the tuple
table = hp.parse_url(url)[0][1]
#table = hp.parse_html_table(htmstring)
table.head()

results in errors:

Error:Traceback (most recent call last):
  File "C:\Users\t01136.POS\eclipsePython-workspace\HTMParse\HTMParse.py", line 80, in <module>
    table = hp.parse_url(url)[0][1]
  File "C:\Users\t01136.POS\eclipsePython-workspace\HTMParse\HTMParse.py", line 25, in parse_url
    for table in tables]  
  File "C:\Users\t01136.POS\eclipsePython-workspace\HTMParse\HTMParse.py", line 25, in <listcomp>
    for table in tables]  
  File "C:\Users\t01136.POS\AppData\Local\Programs\Python\Python36-32\lib\site-packages\bs4\element.py", line 1321, in __getitem__
    return self.attrs[key]
KeyError: 'id'

PS, Im running in eclipsePython, with ver Python 3.6.7

You will get more out of a step by step tutorial.
see on this website:
Web scraping part 1
Web scraping part 2

I am not reading the code, just error message: problem is on row 25 (list comprehension) and problem itself is that there is no key with name 'id'.

Obviously that author has not ran their code in a long time. That tutorial was written almost 4 years ago. The HTML surely change by now. Which could break the code. You can inspect the element to dissect the HTML to be able to update that code. It looks like the website possibly added a new table without an ID maybe. But it could take some work depending. I would suggest to read up on our tutorials as we would modify the code to update current HTML.
Web scraping part 1
Web scraping part 2

EDIT:
If that is close to what you want to do, then what exactly do you want? It might be better to just rewrite the code. You can get the table you are looking for by doing this and then assign it, then do whatever with it that your looking for

from bs4 import BeautifulSoup
import requests

URL = "https://www.fantasypros.com/nfl/reports/leaders/qb.php?year=2015" 

r = requests.get(URL)
soup = BeautifulSoup(r.text, 'lxml')
table = soup.find('table', {'id':'data'})
trs = table.tbody.find_all('tr')
for tr in trs:
    tds = tr.find_all('td')
    for td in tds:
        print(td.text)

Also if your already using pandas you can read the table directly with pandas. This gets the same content as the previous code snippet.

import pandas as pd

URL = "https://www.fantasypros.com/nfl/reports/leaders/qb.php?year=2015" 

df = pd.read_html(URL)
df = df[0]
print(df)

Thanks Guys! Your information helped me over the hump. I was able to look through the code provided by Metalburr and backtrack from there - I ultimately realized that the pages I am hitting have no 'id' tag, as Perfingo pointed out, dont even have a 'tbody' tag.

I eventually had just parse all of the 'td' tags and pull my data that way - not at all elegant - as a matter of fact it looks like the insides of a v8 with 300k on it, whose oil had never been changed: but I was looking for a fast fix, and it works.

It allowed me to get a working script that pulls a list of url targets from a file (for cisco phone pages), strip key data from their streaming statistics, dump it to a csv, and repeat again in 5 minutes.

Thanks for all your help - I want to go back and review the links Larz60+ provided when I have a little more time.

wolf8963

Larz60+

perfringo

metulburr

wolf8963