Python Forum
How to capture Single Column from Web Html Table?
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
How to capture Single Column from Web Html Table?
#1
Hello,

I am trying to get complete values of three distinct columns from an online HTML table that has 13 columns in total. I have parsed the Html content using beautiful soup and can display the values as text files using the code below. However, the output generated just lists out everything in the table and I can't seem to figure out how to simply extract the columns that I need. The code (which I have written until now) is as below:

import re
from bs4 import BeautifulSoup as soup
from urllib.request import urlopen as uReq
my_url = 'http://stats.espncricinfo.com/ci/engine/player/348144.html?class=3;template=results;type=batting;view=innings'
uClient = uReq(my_url)
page_html = uClient.read()
uClient.close()
page_soup = soup(page_html, "html.parser")
container_main = page_soup.findAll("div", {"id":"ciHomeContent"})
container_main = container_main[0]
container_secondary = container_main.findAll("div", {"id":"ciMainContainer"})
container_secondary = container_secondary[0]
container_tertiary = container_secondary.findAll("div", {"id":"ciHomeContentlhs"})
container_tertiary = container_tertiary[0]
container_sublevel = container_tertiary.findAll("div", {"class":"pnl650M"})
container_sublevel = container_sublevel[0]
container_mainTable = container_sublevel.findAll("table", {"class":"engineTable"})
container_mainTable = container_mainTable[3]

for table in container_mainTable:
    Test = container_mainTable.tbody
    Test = Test.text
    print (Test)
This code gives me the following output:



15*
13
11
2
0
136.36
3
not out
2

v England
Manchester
7 Sep 2016
T20I # 566

----------------------------------------------New Record Begins (Inserted by User)-------------------------------------------

55*
49
37
6
2
148.64
3
not out
2

v West Indies
Dubai (DSC)
23 Sep 2016
T20I # 568

-----------------------------------------------New Record Begins (Until the final row in the table)---------------------------------------
From the above output, I just want values (which are denoted by between a '[td][/td]' tag) in the first column (15*, 55*.......). Values in 3rd Column (11, 37.....) and values in the 4th (2, 6.....) & 5th (0, 2......) columns. Following that, I would most probably export them to a CSV file along with generating Gnuplot graphs and other charts.

The data that I am scraping here is hosted in here http://stats.espncricinfo.com/ci/engine/...ew=innings and below are the first few HTML tags that I am trying to export this data from

<tbody>
        <tr class="data1">
                <td>15*</td>
                <td>13</td>
                <td>11</td>
                <td>2</td>
                <td>0</td>
                <td>136.36</td>
                <td>3</td>
                <td nowrap="nowrap">not out</td>
                <td>2</td>
                <td></td>
                <td class="left" nowrap="nowrap">v <a href="/ci/content/team/1.html" class="data-link">England</a></td>
                <td class="left" nowrap="nowrap"><a href="/ci/content/ground/57160.html" class="data-link">Manchester</a></td>
                <td nowrap="nowrap"><b>7 Sep 2016</b></td>
                <td style="white-space: nowrap;"><a href="/ci/engine/match/913663.html" title="view the scorecard for this row">T20I # 566</a></td>
        </tr>
        <tr class="data1">
                <td>55*</td>
                <td>49</td>
                <td>37</td>
                <td>6</td>
                <td>2</td>
                <td>148.64</td>
                <td>3</td>
                <td nowrap="nowrap">not out</td>
                <td>2</td>
                <td></td>
                <td class="left" nowrap="nowrap">v <a href="/ci/content/team/4.html" class="data-link">West Indies</a></td>
                <td class="left" nowrap="nowrap"><a href="/ci/content/ground/392627.html" class="data-link">Dubai (DSC)</a></td>
                <td nowrap="nowrap"><b>23 Sep 2016</b></td>
                <td style="white-space: nowrap;"><a href="/ci/engine/match/1050217.html" title="view the scorecard for this row">T20I # 568</a></td>
        </tr>
        <tr class="data1">
                <td class="padAst">19</td>
                <td>28</td>
                <td>18</td>
                <td>2</td>
                <td>0</td>
                <td>105.55</td>
                <td>3</td>
                <td>caught</td>
                <td>1</td>
                <td></td>
                <td class="left" nowrap="nowrap">v <a href="/ci/content/team/4.html" class="data-link">West Indies</a></td>
                <td class="left" nowrap="nowrap"><a href="/ci/content/ground/392627.html" class="data-link">Dubai (DSC)</a></td>
                <td nowrap="nowrap"><b>24 Sep 2016</b></td>
                <td style="white-space: nowrap;"><a href="/ci/engine/match/1050219.html" title="view the scorecard for this row">T20I # 569</a></td>
        </tr>
        <tr class="data1">
                <td>27*</td>
                <td>42</td>
                <td>24</td>
                <td>1</td>
                <td>0</td>
                <td>112.50</td>
                <td>3</td>
                <td nowrap="nowrap">not out</td>
                <td>2</td>
                <td></td>
                <td class="left" nowrap="nowrap">v <a href="/ci/content/team/4.html" class="data-link">West Indies</a></td>
                <td class="left" nowrap="nowrap"><a href="/ci/content/ground/59396.html" class="data-link">Abu Dhabi</a></td>
                <td nowrap="nowrap"><b>27 Sep 2016</b></td>
                <td style="white-space: nowrap;"><a href="/ci/engine/match/1050221.html" title="view the scorecard for this row">T20I # 570</a></td>
        </tr>
Any help on the matter would be highly appreciated.

Thanks
Waqas
Reply
#2
Is having table as pandas dataframe is ok?

>>> df = pd.read_html('http://stats.espncricinfo.com/ci/engine/player/348144.htmlclass=3;template=results;type=batting;view=innings')
>>> df[3]
   Runs Mins  BF  ...           Ground   Start Date  Unnamed: 13
0   15*   13  11  ...       Manchester   7 Sep 2016   T20I # 566
1   55*   49  37  ...      Dubai (DSC)  23 Sep 2016   T20I # 568
2    19   28  18  ...      Dubai (DSC)  24 Sep 2016   T20I # 569
3   27*   42  24  ...        Abu Dhabi  27 Sep 2016   T20I # 570
4    29    -  30  ...       Bridgetown  26 Mar 2017   T20I # 602
5    27    -  28  ...    Port of Spain  30 Mar 2017   T20I # 603
6    43    -  38  ...    Port of Spain   1 Apr 2017   T20I # 604
7    38    -  36  ...    Port of Spain   2 Apr 2017   T20I # 605
8    86    -  52  ...           Lahore  12 Sep 2017   T20I # 619
9    45    -  38  ...           Lahore  13 Sep 2017   T20I # 620
10   48    -  31  ...           Lahore  15 Sep 2017   T20I # 621
11    1    -   8  ...        Abu Dhabi  26 Oct 2017   T20I # 625
12    1    -   2  ...        Abu Dhabi  27 Oct 2017   T20I # 627
13  34*    -  31  ...           Lahore  29 Oct 2017   T20I # 629
14   41   68  41  ...       Wellington  22 Jan 2018   T20I # 639
15  50*   45  29  ...         Auckland  25 Jan 2018   T20I # 640
16   18    -  17  ...  Mount Maunganui  28 Jan 2018   T20I # 641
17   17    -  13  ...          Karachi   1 Apr 2018   T20I # 663
18  97*    -  58  ...          Karachi   2 Apr 2018   T20I # 664
19   51    -  40  ...          Karachi   3 Apr 2018   T20I # 665
20  68*    -  55  ...        Abu Dhabi  24 Oct 2018   T20I # 701
21   45    -  44  ...      Dubai (DSC)  26 Oct 2018   T20I # 702
22   50    -  40  ...      Dubai (DSC)  28 Oct 2018   T20I # 704
23    7    -   9  ...        Abu Dhabi  31 Oct 2018   T20I # 705
24   40    -  41  ...      Dubai (DSC)   2 Nov 2018   T20I # 706
25   79   66  58  ...      Dubai (DSC)   4 Nov 2018   T20I # 708
26   38   55  27  ...        Cape Town   1 Feb 2019   T20I # 732
27   90   78  58  ...     Johannesburg   3 Feb 2019   T20I # 734
28   23   10  11  ...        Centurion   6 Feb 2019   T20I # 736
29   65    -  42  ...          Cardiff   5 May 2019   T20I # 772

[30 rows x 14 columns]
I'm not 'in'-sane. Indeed, I am so far 'out' of sane that you appear a tiny blip on the distant coast of sanity. Bucky Katt, Get Fuzzy

Da Bishop: There's a dead bishop on the landing. I don't know who keeps bringing them in here. ....but society is to blame.
Reply
#3
(Jul-11-2019, 06:50 AM)perfringo Wrote: Is having table as pandas dataframe is ok?

>>> df = pd.read_html('http://stats.espncricinfo.com/ci/engine/player/348144.htmlclass=3;template=results;type=batting;view=innings')
>>> df[3]
   Runs Mins  BF  ...           Ground   Start Date  Unnamed: 13
0   15*   13  11  ...       Manchester   7 Sep 2016   T20I # 566
1   55*   49  37  ...      Dubai (DSC)  23 Sep 2016   T20I # 568
2    19   28  18  ...      Dubai (DSC)  24 Sep 2016   T20I # 569
3   27*   42  24  ...        Abu Dhabi  27 Sep 2016   T20I # 570
4    29    -  30  ...       Bridgetown  26 Mar 2017   T20I # 602
5    27    -  28  ...    Port of Spain  30 Mar 2017   T20I # 603
6    43    -  38  ...    Port of Spain   1 Apr 2017   T20I # 604
7    38    -  36  ...    Port of Spain   2 Apr 2017   T20I # 605
8    86    -  52  ...           Lahore  12 Sep 2017   T20I # 619
9    45    -  38  ...           Lahore  13 Sep 2017   T20I # 620
10   48    -  31  ...           Lahore  15 Sep 2017   T20I # 621
11    1    -   8  ...        Abu Dhabi  26 Oct 2017   T20I # 625
12    1    -   2  ...        Abu Dhabi  27 Oct 2017   T20I # 627
13  34*    -  31  ...           Lahore  29 Oct 2017   T20I # 629
14   41   68  41  ...       Wellington  22 Jan 2018   T20I # 639
15  50*   45  29  ...         Auckland  25 Jan 2018   T20I # 640
16   18    -  17  ...  Mount Maunganui  28 Jan 2018   T20I # 641
17   17    -  13  ...          Karachi   1 Apr 2018   T20I # 663
18  97*    -  58  ...          Karachi   2 Apr 2018   T20I # 664
19   51    -  40  ...          Karachi   3 Apr 2018   T20I # 665
20  68*    -  55  ...        Abu Dhabi  24 Oct 2018   T20I # 701
21   45    -  44  ...      Dubai (DSC)  26 Oct 2018   T20I # 702
22   50    -  40  ...      Dubai (DSC)  28 Oct 2018   T20I # 704
23    7    -   9  ...        Abu Dhabi  31 Oct 2018   T20I # 705
24   40    -  41  ...      Dubai (DSC)   2 Nov 2018   T20I # 706
25   79   66  58  ...      Dubai (DSC)   4 Nov 2018   T20I # 708
26   38   55  27  ...        Cape Town   1 Feb 2019   T20I # 732
27   90   78  58  ...     Johannesburg   3 Feb 2019   T20I # 734
28   23   10  11  ...        Centurion   6 Feb 2019   T20I # 736
29   65    -  42  ...          Cardiff   5 May 2019   T20I # 772

[30 rows x 14 columns]

Do I have to restructure my code from scratch? Is there no way I can use my existing code to get the columns that I might need?

Thanks
Reply
#4
(Jul-11-2019, 05:12 AM)ahmedwaqas92 Wrote: I would most probably export them to a CSV file

Link appears to be dead, so generic example how to write specific columns to file:

>>> import pandas as pd
>>> d = {'ham': [1, 2, 3], 'spam': ['a', 'b', 'c'], 'bacon': ['1a', '2b', '3c']}
>>> df = pd.DataFrame(d)
>>> df
   ham spam bacon
0    1    a    1a
1    2    b    2b
2    3    c    3c
>>> df.to_csv('out.csv', columns=['ham', 'bacon'], index=False)
The content of out.csv is:

Output:
ham,bacon 1,1a 2,2b 3,3c
I'm not 'in'-sane. Indeed, I am so far 'out' of sane that you appear a tiny blip on the distant coast of sanity. Bucky Katt, Get Fuzzy

Da Bishop: There's a dead bishop on the landing. I don't know who keeps bringing them in here. ....but society is to blame.
Reply
#5
(Jul-12-2019, 04:34 AM)ahmedwaqas92 Wrote: Do I have to restructure my code from scratch? Is there no way I can use my existing code to get the columns that I might need?
It's just a lot more work and it's not so easy either as have to make columns as it's not clear defined in html.
Need to clean up data to,if gone do plot or other stuff.
As shown by @perfringo Pandas make this a lot easier.

Here a NoteBook where i show some different stuff that may be needed,like clean up take out columns.
Reply
#6
(Jul-12-2019, 06:26 AM)perfringo Wrote:
(Jul-11-2019, 05:12 AM)ahmedwaqas92 Wrote: I would most probably export them to a CSV file

Link appears to be dead, so generic example how to write specific columns to file:

>>> import pandas as pd
>>> d = {'ham': [1, 2, 3], 'spam': ['a', 'b', 'c'], 'bacon': ['1a', '2b', '3c']}
>>> df = pd.DataFrame(d)
>>> df
   ham spam bacon
0    1    a    1a
1    2    b    2b
2    3    c    3c
>>> df.to_csv('out.csv', columns=['ham', 'bacon'], index=False)
The content of out.csv is:

Output:
ham,bacon 1,1a 2,2b 3,3c

(Jul-12-2019, 10:48 AM)snippsat Wrote:
(Jul-12-2019, 04:34 AM)ahmedwaqas92 Wrote: Do I have to restructure my code from scratch? Is there no way I can use my existing code to get the columns that I might need?
It's just a lot more work and it's not so easy either as have to make columns as it's not clear defined in html.
Need to clean up data to,if gone do plot or other stuff.
As shown by @perfringo Pandas make this a lot easier.

Here a NoteBook where i show some different stuff that may be needed,like clean up take out columns.


Apologies for the delayed response on the matter, I reviewed the detailed instructions on both these examples and then read a few docs pertaining to the use of data frames in Python. Based on all what I gathered in the last week or so I have managed to make a small Python script which captures the data as to how I want. Then it removes the excess columns, removes special characters and does some calculation as well. The results are finally presented in a pie chart - See code below

import pandas as pd
import matplotlib.pyplot as plt
dataMain = pd.read_html('http://stats.espncricinfo.com/ci/engine/player/422108.html?class=3;spanmin1=07+Sep+2016;spanval1=span;template=results;type=batting;view=innings')
dataTabulated = dataMain[3]
columns = [0,2,3,4]
tempFrame = pd.DataFrame(dataTabulated)
dataFinal = tempFrame[tempFrame.columns[columns]]
dataFinal = dataFinal[~dataFinal.Runs.str.contains("DNB")]
dataFinal = dataFinal.replace('\*','',regex=True).astype(float)


totalRuns = dataFinal['Runs'].sum()
ballsFaced = dataFinal['BF'].sum()
fours = dataFinal['4s'].sum()
sixes = dataFinal['6s'].sum()

totalFours = fours * 4
totalSixes = sixes * 6

#This is Calculating the Total runs Scored in Boundaries and Rotation(1s,2s,3s)
boundaryRuns = totalFours + totalSixes
rotationRuns = totalRuns - boundaryRuns

#This is Calculating the Percentage runs in Rotation(1s,2s,3s) & in Boundaries (1s,2s,3s)
rotationRuns_p = rotationRuns / totalRuns * 100
boundaryRuns_p = boundaryRuns / totalRuns * 100

#Calculating Approximate & Percentage Dot balls
forBall = ballsFaced-(fours+sixes)
forRun = totalRuns-(totalFours+totalSixes)
approxDot = forBall-forRun
approxDot_p = approxDot/ballsFaced * 100
score_P = 100 - approxDot_p

print(round(approxDot_p,2))

#Plotting for Boundaries / Rotation Ratio
labels = 'Rotation', 'Boundaries'
sizes = [rotationRuns_p, boundaryRuns_p]
colors = ['yellowgreen', 'yellow']
explode = [0.1, 0] #Explode 1st Slice

#Plotting the Pie Chart
plt.pie(sizes, explode=explode, labels=labels, colors=colors, autopct='%1.1f%%', shadow=True, startangle=140)
plt.axis('equal')
plt.show()

#Plotting for Dot Ball Percentage
labels = 'Approx Dot %', 'Scoring %'
sizes = [approxDot_p, score_P]
colors = ['red', 'blue']
explode = [0.1, 0] #Explode 1st Slice

#Plotting the Pie Chart
plt.pie(sizes, explode=explode, labels=labels, colors=colors, autopct='%1.1f%%', shadow=True, startangle=140)
plt.axis('equal')
plt.show()

I guess now it seems that my problem has been resolved so you can mark this thread as solved. Your help on the matter @perfringo & @snippsat is much appreciated :)
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  Help: Beautiful Soup - Parsing HTML table ironfelix717 2 383 Oct-01-2020, 02:19 PM
Last Post: snippsat
  Python3 + BeautifulSoup4 + lxml (HTML -> CSV) - How to loop to next HTML/new CSV Row BrandonKastning 0 501 Mar-22-2020, 06:10 AM
Last Post: BrandonKastning
  Imprt HTML table to array meleghengersor 2 467 Jan-23-2020, 10:23 AM
Last Post: perfringo
  BeautifulSoup: Error while extracting a value from an HTML table kawasso 3 925 Aug-25-2019, 01:13 AM
Last Post: kawasso
  convert html table to json bhojendra 5 6,072 Jul-30-2019, 07:53 PM
Last Post: DeaD_EyE
  sqlalchemy DataTables::"No data available in table" when using self-joined table Asma 0 1,111 Nov-22-2018, 02:46 PM
Last Post: Asma
  How to capture value from javascript using flask Prince_Bhatia 0 946 Nov-02-2018, 10:20 AM
Last Post: Prince_Bhatia

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020