Bottom Page

Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
 Need help with Beautiful Soup - table
#1
I am very much a newbie and I'm just trying to learn. Here is my code

import requests
from bs4 import BeautifulSoup
import csv

url = 'http://www.cfbstats.com/2018/team/234/index.html'
r = requests.get(url)
soup = BeautifulSoup(r.text, 'html.parser')

table = soup.findAll("table",{"class":"team-schedule"})

for row in table:
 tds = row.findAll('td')
for td in tds:
  print(td.text)
The results return but by line.

09/03/18 Virginia Tech L 3-24 3:12 75,237 09/08/18 Samford W 36-26 3:51 72,239 09/15/18 @ 17 Syracuse L 7-30 3:37 37,457 09/22/18 Northern Ill. W 37-19 3:34 65,633 09/29/18 @ Louisville W 28-24 3:27 52,798 10/06/18 @ Miami (Fla.) L 27-28 4:01 65,490 10/20/18 Wake Forest W 38-17 3:34 67,274 10/27/18 2 Clemson L 10-59 3:47 68,403 11/03/18 @ North Carolina St. L 28-47 3:33 57,600 11/10/18 @ 3 Notre Dame L 13-42 3:22 77,622 11/17/18 Boston College W 22-21 3:31 57,274 11/24/18 10 Florida L 14-41 3:27 71,953 @ : Away, + : Neutral Site

My goal is to return the columns with date, opponent, and attendance (at least). The last row is immaterial and needs to be removed. It would also be good to learn how to create an additional column where if you see a @ in opponent the column says A, + is N, and neither is H.

The date and opponent names have classes in the table but attendance does not.

Appreciate any guidance. It's just a learning exercise.
Quote
#2
this works here

from urllib.request import urlopen
from bs4 import BeautifulSoup as bsoup

url = 'http://www.cfbstats.com/2018/team/234/index.html'

ofile = urlopen(url)
soup = bsoup(ofile, "html.parser", from_encoding='utf-8')
soup.prettify()

table = soup.find("table", attrs={"class":"team-schedule"})

datasets = []
mytable = table.find_all("tr")#[1:]
for row in mytable:
    text = str(row.get_text()).split('\n')
    datasets.append(text)

_len = len(datasets)
for x in range(_len -1):
    t = datasets[x]
    print((t[1] + '\t' + t[2] + '\t' + t[5]).expandtabs(30))
Output:
Date Opponent Attendance 09/03/18 Virginia Tech 75,237 09/08/18 Samford 72,239 09/15/18 @ 17 Syracuse 37,457 09/22/18 Northern Ill. 65,633 09/29/18 @ Louisville 52,798 10/06/18 @ Miami (Fla.) 65,490 10/20/18 Wake Forest 67,274 10/27/18 2 Clemson 68,403 11/03/18 @ North Carolina St. 57,600 11/10/18 @ 3 Notre Dame 77,622 11/17/18 Boston College 57,274 11/24/18 10 Florida 71,953
Quote
#3
I did it a bit differently, same results:
import requests
from bs4 import BeautifulSoup
import csv
import os


url = 'http://www.cfbstats.com/2018/team/234/index.html'
r = requests.get(url)
soup = BeautifulSoup(r.text, 'html.parser')
 
table = soup.findAll("table",{"class": "team-schedule"})[0]
trs = table.find_all('tr')
header = []
for n, tr in enumerate(trs):
    if n == 0:
        # Get Header
        ths = tr.find_all('th')
        for th in ths:
            header.append(th.text.strip())
        for item in header:
            print('{:22}'.format(item), end='')
        print()

        continue
    else:
        game_item = []
        tds = tr.find_all('td')
        for td in tds:
            game_item.append(td.text.strip())
    for item in game_item:
        print('{:22}'.format(item), end='')
    print()
Output:
Date Opponent Result Game Time Attendance 09/03/18 Virginia Tech L 3-24 3:12 75,237 09/08/18 Samford W 36-26 3:51 72,239 09/15/18 @ 17 Syracuse L 7-30 3:37 37,457 09/22/18 Northern Ill. W 37-19 3:34 65,633 09/29/18 @ Louisville W 28-24 3:27 52,798 10/06/18 @ Miami (Fla.) L 27-28 4:01 65,490 10/20/18 Wake Forest W 38-17 3:34 67,274 10/27/18 2 Clemson L 10-59 3:47 68,403 11/03/18 @ North Carolina St. L 28-47 3:33 57,600 11/10/18 @ 3 Notre Dame L 13-42 3:22 77,622 11/17/18 Boston College W 22-21 3:31 57,274 11/24/18 10 Florida L 14-41 3:27 71,953 @ : Away, + : Neutral Site
buran likes this post
Quote
#4
Axel - Very interesting. Thank you!

Do you mind stepping through some questions/assumptions?

This creates a dataset from a table that takes all rows in the table, splits the string after a space and creates a new line. The rows are then appended.

datasets = []
mytable = table.find_all("tr")#[1:]
for row in mytable:
    text = str(row.get_text()).split('\n')
    datasets.append(text)
I'm having a real hard time following this one - and how did the headers get there?

_len = len(datasets)
for x in range(_len -1):
    t = datasets[x]
    print((t[1] + '\t' + t[2] + '\t' + t[5]).expandtabs(30))
I have learned some code for csv writer. Below is a sample.

 with open('test_cfbstats.csv', 'w', newline='') as f:
    writer = csv.writer(f)
    writer.writerow(['Date', 'Opponent'])
    writer.writerows(data)
How would you suggest modifying for use in your code? I'm not sure if the writerow would be necessary, and the writerows would change to datasets?


File "cfbstats_larz.py", line 9, in <module>
soup = BeautifulSoup(page, 'html.parser')
NameError: name 'page' is not defined

Is is something I did?

(Dec-15-2018, 11:10 PM)Larz60+ Wrote: I did it a bit differently, same results:
import requests
from bs4 import BeautifulSoup
import csv
import os


url = 'http://www.cfbstats.com/2018/team/234/index.html'
r = requests.get(url)
soup = BeautifulSoup(page, 'html.parser')
 
table = soup.findAll("table",{"class": "team-schedule"})[0]
trs = table.find_all('tr')
header = []
for n, tr in enumerate(trs):
    if n == 0:
        # Get Header
        ths = tr.find_all('th')
        for th in ths:
            header.append(th.text.strip())
        for item in header:
            print('{:22}'.format(item), end='')
        print()

        continue
    else:
        game_item = []
        tds = tr.find_all('td')
        for td in tds:
            game_item.append(td.text.strip())
    for item in game_item:
        print('{:22}'.format(item), end='')
    print()
Output:
Date Opponent Result Game Time Attendance 09/03/18 Virginia Tech L 3-24 3:12 75,237 09/08/18 Samford W 36-26 3:51 72,239 09/15/18 @ 17 Syracuse L 7-30 3:37 37,457 09/22/18 Northern Ill. W 37-19 3:34 65,633 09/29/18 @ Louisville W 28-24 3:27 52,798 10/06/18 @ Miami (Fla.) L 27-28 4:01 65,490 10/20/18 Wake Forest W 38-17 3:34 67,274 10/27/18 2 Clemson L 10-59 3:47 68,403 11/03/18 @ North Carolina St. L 28-47 3:33 57,600 11/10/18 @ 3 Notre Dame L 13-42 3:22 77,622 11/17/18 Boston College W 22-21 3:31 57,274 11/24/18 10 Florida L 14-41 3:27 71,953 @ : Away, + : Neutral Site

My apologies for combining the replies, I don't know what happened.
Quote
#5
That's me, I had renamed it page for my testing, and thought I had renamed everything, mieede line 9 which should read:
soup = BeautifulSoup(r.text, 'html.parser')
I also edited my original post
Quote
#6
(Dec-16-2018, 05:02 PM)jlkmb Wrote: and how did the headers get there?

The column headings are within <tr> </tr>

to save it as csv:
(change the delimiter to what you need)

from urllib.request import urlopen
from bs4 import BeautifulSoup as bsoup
import  csv
 
url = 'http://www.cfbstats.com/2018/team/234/index.html'
 
ofile = urlopen(url)
soup = bsoup(ofile, "html.parser", from_encoding='utf-8')
soup.prettify()
 
table = soup.find("table", attrs={"class":"team-schedule"})
 
datasets = []
mytable = table.find_all("tr")#[1:]
for row in mytable:
    text = str(row.get_text()).split('\n')
    datasets.append(text)

mypath = '/tmp/test_cfbstats.csv'
with open(mypath, 'w') as stream:
    writer = csv.writer(stream, delimiter='\t')
    _len = len(datasets)
    for x in range(_len -1):
        t = datasets[x]
        myrow = [t[1], t[2], t[5]]
        writer.writerow(myrow)
Quote
#7
Thanks Axel - Can you explain what the following code does?

    _len = len(datasets)
    for x in range(_len -1):
        t = datasets[x]
        myrow = [t[1], t[2], t[5]]
(Dec-16-2018, 06:50 PM)Axel_Erfurt Wrote:
(Dec-16-2018, 05:02 PM)jlkmb Wrote: and how did the headers get there?

The column headings are within <tr> </tr>

to save it as csv:
(change the delimiter to what you need)

from urllib.request import urlopen
from bs4 import BeautifulSoup as bsoup
import  csv
 
url = 'http://www.cfbstats.com/2018/team/234/index.html'
 
ofile = urlopen(url)
soup = bsoup(ofile, "html.parser", from_encoding='utf-8')
soup.prettify()
 
table = soup.find("table", attrs={"class":"team-schedule"})
 
datasets = []
mytable = table.find_all("tr")#[1:]
for row in mytable:
    text = str(row.get_text()).split('\n')
    datasets.append(text)

mypath = '/tmp/test_cfbstats.csv'
with open(mypath, 'w') as stream:
    writer = csv.writer(stream, delimiter='\t')
    _len = len(datasets)
    for x in range(_len -1):
        t = datasets[x]
        myrow = [t[1], t[2], t[5]]
        writer.writerow(myrow)

Thanks Larz. That makes sense. A few questions just to make sure I understand.

I hadn't seen enumerate yet. Interesting. Where is n defined? How about item? Can you explain the print statement?

  
trs = table.find_all('tr')
header = []
for n, tr in enumerate(trs):
    if n == 0:
        # Get Header
        ths = tr.find_all('th')
        for th in ths:
            header.append(th.text.strip())
        for item in header:
            print('{:22}'.format(item), end='')
        print()
How would I get the last line to not appear?

Larz - forgot to ask - do you implement csv the same way as Axel?
Quote
#8
n is a name that I mage up. it could be any name that you wish. Same with item, tr, th trs
In some languages you have to declare variables before using them, in Python you can declase and use in same operation.
enumerate returns iteration number of loop.
Quote:How would I get the last line to not appear?
Not sure what you're asking here. if it's about the print() at the end, that just sends a newline, otherwise if you had additional print statements, they would end up on the same line, one after the other (the end='' suppresses a newline)
Quote
#9
He means the last line in the table

Quote:@ : Away, + : Neutral Site

that' why I used

_len = len(datasets)
    for x in range(_len -1):
You do not need CSV writer for the csv file.

from urllib.request import urlopen
from bs4 import BeautifulSoup as bsoup

url = 'http://www.cfbstats.com/2018/team/234/index.html'
ofile = urlopen(url)
soup = bsoup(ofile, "html.parser", from_encoding='utf-8')
soup.prettify()
  
table = soup.find("table", attrs={"class":"team-schedule"})
  
datasets = []
mytable = table.find_all("tr")
for row in mytable:
    text = str(row.get_text()).split('\n')
    datasets.append(text)
 
mypath = '/tmp/test_cfbstats.csv'
with open(mypath, 'w') as stream:
    _len = len(datasets)
    for x in range(_len -1):
        t = datasets[x]
        myrow = [t[1], t[2], t[5]]
        t = "\t".join(myrow)
        print(t.expandtabs(22))
        stream.write(t + "\n")  
    stream.close()
Quote
#10
Thanks guys. I appreciate your help!
Quote

Top Page

Possibly Related Threads...
Thread Author Replies Views Last Post
  looking for direction - scrappy, crawler, beautiful soup Sly_Corn 2 86 Mar-17-2020, 03:17 PM
Last Post: Sly_Corn
  Beautiful soup truncates results jonesjoz 4 194 Mar-09-2020, 06:04 PM
Last Post: jonesjoz
  Beautiful soup and tags starter_student 11 1,002 Jul-08-2019, 03:41 PM
Last Post: starter_student
  Beautiful Soup find_all() kirito85 2 753 Jun-14-2019, 02:17 AM
Last Post: kirito85
  [split] Using beautiful soup to get html attribute value moski 6 1,253 Jun-03-2019, 04:24 PM
Last Post: moski
  Using beautiful soup to get html attribute value graham23s 2 5,023 Apr-23-2019, 09:21 PM
Last Post: graham23s
  Failure in web scraping by Beautiful Soup yeungcase 4 1,504 Mar-23-2019, 12:36 PM
Last Post: metulburr
  Beautiful soup won't find value even with CSS path copied. AdequatelyChilled 4 1,118 Jan-01-2019, 12:12 PM
Last Post: snippsat
  sqlalchemy DataTables::"No data available in table" when using self-joined table Asma 0 745 Nov-22-2018, 02:46 PM
Last Post: Asma
  using regex wildcard Beautiful Soup Larz60+ 6 4,060 Sep-27-2018, 09:19 PM
Last Post: Larz60+

Forum Jump:


Users browsing this thread: 1 Guest(s)