Python Forum
Need help with Beautiful Soup - table
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Need help with Beautiful Soup - table
#1
I am very much a newbie and I'm just trying to learn. Here is my code

import requests
from bs4 import BeautifulSoup
import csv

url = 'http://www.cfbstats.com/2018/team/234/index.html'
r = requests.get(url)
soup = BeautifulSoup(r.text, 'html.parser')

table = soup.findAll("table",{"class":"team-schedule"})

for row in table:
 tds = row.findAll('td')
for td in tds:
  print(td.text)
The results return but by line.

09/03/18 Virginia Tech L 3-24 3:12 75,237 09/08/18 Samford W 36-26 3:51 72,239 09/15/18 @ 17 Syracuse L 7-30 3:37 37,457 09/22/18 Northern Ill. W 37-19 3:34 65,633 09/29/18 @ Louisville W 28-24 3:27 52,798 10/06/18 @ Miami (Fla.) L 27-28 4:01 65,490 10/20/18 Wake Forest W 38-17 3:34 67,274 10/27/18 2 Clemson L 10-59 3:47 68,403 11/03/18 @ North Carolina St. L 28-47 3:33 57,600 11/10/18 @ 3 Notre Dame L 13-42 3:22 77,622 11/17/18 Boston College W 22-21 3:31 57,274 11/24/18 10 Florida L 14-41 3:27 71,953 @ : Away, + : Neutral Site

My goal is to return the columns with date, opponent, and attendance (at least). The last row is immaterial and needs to be removed. It would also be good to learn how to create an additional column where if you see a @ in opponent the column says A, + is N, and neither is H.

The date and opponent names have classes in the table but attendance does not.

Appreciate any guidance. It's just a learning exercise.
Reply
#2
this works here

from urllib.request import urlopen
from bs4 import BeautifulSoup as bsoup

url = 'http://www.cfbstats.com/2018/team/234/index.html'

ofile = urlopen(url)
soup = bsoup(ofile, "html.parser", from_encoding='utf-8')
soup.prettify()

table = soup.find("table", attrs={"class":"team-schedule"})

datasets = []
mytable = table.find_all("tr")#[1:]
for row in mytable:
    text = str(row.get_text()).split('\n')
    datasets.append(text)

_len = len(datasets)
for x in range(_len -1):
    t = datasets[x]
    print((t[1] + '\t' + t[2] + '\t' + t[5]).expandtabs(30))
Output:
Date Opponent Attendance 09/03/18 Virginia Tech 75,237 09/08/18 Samford 72,239 09/15/18 @ 17 Syracuse 37,457 09/22/18 Northern Ill. 65,633 09/29/18 @ Louisville 52,798 10/06/18 @ Miami (Fla.) 65,490 10/20/18 Wake Forest 67,274 10/27/18 2 Clemson 68,403 11/03/18 @ North Carolina St. 57,600 11/10/18 @ 3 Notre Dame 77,622 11/17/18 Boston College 57,274 11/24/18 10 Florida 71,953
Reply
#3
I did it a bit differently, same results:
import requests
from bs4 import BeautifulSoup
import csv
import os


url = 'http://www.cfbstats.com/2018/team/234/index.html'
r = requests.get(url)
soup = BeautifulSoup(r.text, 'html.parser')
 
table = soup.findAll("table",{"class": "team-schedule"})[0]
trs = table.find_all('tr')
header = []
for n, tr in enumerate(trs):
    if n == 0:
        # Get Header
        ths = tr.find_all('th')
        for th in ths:
            header.append(th.text.strip())
        for item in header:
            print('{:22}'.format(item), end='')
        print()

        continue
    else:
        game_item = []
        tds = tr.find_all('td')
        for td in tds:
            game_item.append(td.text.strip())
    for item in game_item:
        print('{:22}'.format(item), end='')
    print()
Output:
Date Opponent Result Game Time Attendance 09/03/18 Virginia Tech L 3-24 3:12 75,237 09/08/18 Samford W 36-26 3:51 72,239 09/15/18 @ 17 Syracuse L 7-30 3:37 37,457 09/22/18 Northern Ill. W 37-19 3:34 65,633 09/29/18 @ Louisville W 28-24 3:27 52,798 10/06/18 @ Miami (Fla.) L 27-28 4:01 65,490 10/20/18 Wake Forest W 38-17 3:34 67,274 10/27/18 2 Clemson L 10-59 3:47 68,403 11/03/18 @ North Carolina St. L 28-47 3:33 57,600 11/10/18 @ 3 Notre Dame L 13-42 3:22 77,622 11/17/18 Boston College W 22-21 3:31 57,274 11/24/18 10 Florida L 14-41 3:27 71,953 @ : Away, + : Neutral Site
Reply
#4
Axel - Very interesting. Thank you!

Do you mind stepping through some questions/assumptions?

This creates a dataset from a table that takes all rows in the table, splits the string after a space and creates a new line. The rows are then appended.

datasets = []
mytable = table.find_all("tr")#[1:]
for row in mytable:
    text = str(row.get_text()).split('\n')
    datasets.append(text)
I'm having a real hard time following this one - and how did the headers get there?

_len = len(datasets)
for x in range(_len -1):
    t = datasets[x]
    print((t[1] + '\t' + t[2] + '\t' + t[5]).expandtabs(30))
I have learned some code for csv writer. Below is a sample.

 with open('test_cfbstats.csv', 'w', newline='') as f:
    writer = csv.writer(f)
    writer.writerow(['Date', 'Opponent'])
    writer.writerows(data)
How would you suggest modifying for use in your code? I'm not sure if the writerow would be necessary, and the writerows would change to datasets?


File "cfbstats_larz.py", line 9, in <module>
soup = BeautifulSoup(page, 'html.parser')
NameError: name 'page' is not defined

Is is something I did?

(Dec-15-2018, 11:10 PM)Larz60+ Wrote: I did it a bit differently, same results:
import requests
from bs4 import BeautifulSoup
import csv
import os


url = 'http://www.cfbstats.com/2018/team/234/index.html'
r = requests.get(url)
soup = BeautifulSoup(page, 'html.parser')
 
table = soup.findAll("table",{"class": "team-schedule"})[0]
trs = table.find_all('tr')
header = []
for n, tr in enumerate(trs):
    if n == 0:
        # Get Header
        ths = tr.find_all('th')
        for th in ths:
            header.append(th.text.strip())
        for item in header:
            print('{:22}'.format(item), end='')
        print()

        continue
    else:
        game_item = []
        tds = tr.find_all('td')
        for td in tds:
            game_item.append(td.text.strip())
    for item in game_item:
        print('{:22}'.format(item), end='')
    print()
Output:
Date Opponent Result Game Time Attendance 09/03/18 Virginia Tech L 3-24 3:12 75,237 09/08/18 Samford W 36-26 3:51 72,239 09/15/18 @ 17 Syracuse L 7-30 3:37 37,457 09/22/18 Northern Ill. W 37-19 3:34 65,633 09/29/18 @ Louisville W 28-24 3:27 52,798 10/06/18 @ Miami (Fla.) L 27-28 4:01 65,490 10/20/18 Wake Forest W 38-17 3:34 67,274 10/27/18 2 Clemson L 10-59 3:47 68,403 11/03/18 @ North Carolina St. L 28-47 3:33 57,600 11/10/18 @ 3 Notre Dame L 13-42 3:22 77,622 11/17/18 Boston College W 22-21 3:31 57,274 11/24/18 10 Florida L 14-41 3:27 71,953 @ : Away, + : Neutral Site

My apologies for combining the replies, I don't know what happened.
Reply
#5
That's me, I had renamed it page for my testing, and thought I had renamed everything, mieede line 9 which should read:
soup = BeautifulSoup(r.text, 'html.parser')
I also edited my original post
Reply
#6
(Dec-16-2018, 05:02 PM)jlkmb Wrote: and how did the headers get there?

The column headings are within <tr> </tr>

to save it as csv:
(change the delimiter to what you need)

from urllib.request import urlopen
from bs4 import BeautifulSoup as bsoup
import  csv
 
url = 'http://www.cfbstats.com/2018/team/234/index.html'
 
ofile = urlopen(url)
soup = bsoup(ofile, "html.parser", from_encoding='utf-8')
soup.prettify()
 
table = soup.find("table", attrs={"class":"team-schedule"})
 
datasets = []
mytable = table.find_all("tr")#[1:]
for row in mytable:
    text = str(row.get_text()).split('\n')
    datasets.append(text)

mypath = '/tmp/test_cfbstats.csv'
with open(mypath, 'w') as stream:
    writer = csv.writer(stream, delimiter='\t')
    _len = len(datasets)
    for x in range(_len -1):
        t = datasets[x]
        myrow = [t[1], t[2], t[5]]
        writer.writerow(myrow)
Reply
#7
Thanks Axel - Can you explain what the following code does?

    _len = len(datasets)
    for x in range(_len -1):
        t = datasets[x]
        myrow = [t[1], t[2], t[5]]
(Dec-16-2018, 06:50 PM)Axel_Erfurt Wrote:
(Dec-16-2018, 05:02 PM)jlkmb Wrote: and how did the headers get there?

The column headings are within <tr> </tr>

to save it as csv:
(change the delimiter to what you need)

from urllib.request import urlopen
from bs4 import BeautifulSoup as bsoup
import  csv
 
url = 'http://www.cfbstats.com/2018/team/234/index.html'
 
ofile = urlopen(url)
soup = bsoup(ofile, "html.parser", from_encoding='utf-8')
soup.prettify()
 
table = soup.find("table", attrs={"class":"team-schedule"})
 
datasets = []
mytable = table.find_all("tr")#[1:]
for row in mytable:
    text = str(row.get_text()).split('\n')
    datasets.append(text)

mypath = '/tmp/test_cfbstats.csv'
with open(mypath, 'w') as stream:
    writer = csv.writer(stream, delimiter='\t')
    _len = len(datasets)
    for x in range(_len -1):
        t = datasets[x]
        myrow = [t[1], t[2], t[5]]
        writer.writerow(myrow)

Thanks Larz. That makes sense. A few questions just to make sure I understand.

I hadn't seen enumerate yet. Interesting. Where is n defined? How about item? Can you explain the print statement?

  
trs = table.find_all('tr')
header = []
for n, tr in enumerate(trs):
    if n == 0:
        # Get Header
        ths = tr.find_all('th')
        for th in ths:
            header.append(th.text.strip())
        for item in header:
            print('{:22}'.format(item), end='')
        print()
How would I get the last line to not appear?

Larz - forgot to ask - do you implement csv the same way as Axel?
Reply
#8
n is a name that I mage up. it could be any name that you wish. Same with item, tr, th trs
In some languages you have to declare variables before using them, in Python you can declase and use in same operation.
enumerate returns iteration number of loop.
Quote:How would I get the last line to not appear?
Not sure what you're asking here. if it's about the print() at the end, that just sends a newline, otherwise if you had additional print statements, they would end up on the same line, one after the other (the end='' suppresses a newline)
Reply
#9
He means the last line in the table

Quote:@ : Away, + : Neutral Site

that' why I used

_len = len(datasets)
    for x in range(_len -1):
You do not need CSV writer for the csv file.

from urllib.request import urlopen
from bs4 import BeautifulSoup as bsoup

url = 'http://www.cfbstats.com/2018/team/234/index.html'
ofile = urlopen(url)
soup = bsoup(ofile, "html.parser", from_encoding='utf-8')
soup.prettify()
  
table = soup.find("table", attrs={"class":"team-schedule"})
  
datasets = []
mytable = table.find_all("tr")
for row in mytable:
    text = str(row.get_text()).split('\n')
    datasets.append(text)
 
mypath = '/tmp/test_cfbstats.csv'
with open(mypath, 'w') as stream:
    _len = len(datasets)
    for x in range(_len -1):
        t = datasets[x]
        myrow = [t[1], t[2], t[5]]
        t = "\t".join(myrow)
        print(t.expandtabs(22))
        stream.write(t + "\n")  
    stream.close()
Reply
#10
Thanks guys. I appreciate your help!
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  Beautiful Soup - access a rating value in a class KatMac 1 3,419 Apr-16-2021, 01:27 PM
Last Post: snippsat
  *Beginner* web scraping/Beautiful Soup help 7ken8 2 2,560 Jan-28-2021, 04:26 PM
Last Post: 7ken8
  Help: Beautiful Soup - Parsing HTML table ironfelix717 2 2,622 Oct-01-2020, 02:19 PM
Last Post: snippsat
  Beautiful Soup (suddenly) doesn't get full webpage html j.crater 8 16,381 Jul-11-2020, 04:31 PM
Last Post: j.crater
  Requests-HTML vs Beautiful Soup - How to Choose? robin73 0 3,780 Jun-23-2020, 02:53 PM
Last Post: robin73
  looking for direction - scrappy, crawler, beautiful soup Sly_Corn 2 2,401 Mar-17-2020, 03:17 PM
Last Post: Sly_Corn
  Beautiful soup truncates results jonesjoz 4 3,793 Mar-09-2020, 06:04 PM
Last Post: jonesjoz
  Beautiful soup and tags starter_student 11 6,047 Jul-08-2019, 03:41 PM
Last Post: starter_student
  Beautiful Soup find_all() kirito85 2 3,311 Jun-14-2019, 02:17 AM
Last Post: kirito85
  [split] Using beautiful soup to get html attribute value moski 6 6,220 Jun-03-2019, 04:24 PM
Last Post: moski

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020