Bottom Page

Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
 How to capture Single Column from Web Html Table?
#1
Hello,

I am trying to get complete values of three distinct columns from an online HTML table that has 13 columns in total. I have parsed the Html content using beautiful soup and can display the values as text files using the code below. However, the output generated just lists out everything in the table and I can't seem to figure out how to simply extract the columns that I need. The code (which I have written until now) is as below:

import re
from bs4 import BeautifulSoup as soup
from urllib.request import urlopen as uReq
my_url = 'http://stats.espncricinfo.com/ci/engine/player/348144.html?class=3;template=results;type=batting;view=innings'
uClient = uReq(my_url)
page_html = uClient.read()
uClient.close()
page_soup = soup(page_html, "html.parser")
container_main = page_soup.findAll("div", {"id":"ciHomeContent"})
container_main = container_main[0]
container_secondary = container_main.findAll("div", {"id":"ciMainContainer"})
container_secondary = container_secondary[0]
container_tertiary = container_secondary.findAll("div", {"id":"ciHomeContentlhs"})
container_tertiary = container_tertiary[0]
container_sublevel = container_tertiary.findAll("div", {"class":"pnl650M"})
container_sublevel = container_sublevel[0]
container_mainTable = container_sublevel.findAll("table", {"class":"engineTable"})
container_mainTable = container_mainTable[3]

for table in container_mainTable:
    Test = container_mainTable.tbody
    Test = Test.text
    print (Test)
This code gives me the following output:



15*
13
11
2
0
136.36
3
not out
2

v England
Manchester
7 Sep 2016
T20I # 566

----------------------------------------------New Record Begins (Inserted by User)-------------------------------------------

55*
49
37
6
2
148.64
3
not out
2

v West Indies
Dubai (DSC)
23 Sep 2016
T20I # 568

-----------------------------------------------New Record Begins (Until the final row in the table)---------------------------------------
From the above output, I just want values (which are denoted by between a '[td][/td]' tag) in the first column (15*, 55*.......). Values in 3rd Column (11, 37.....) and values in the 4th (2, 6.....) & 5th (0, 2......) columns. Following that, I would most probably export them to a CSV file along with generating Gnuplot graphs and other charts.

The data that I am scraping here is hosted in here http://stats.espncricinfo.com/ci/engine/...ew=innings and below are the first few HTML tags that I am trying to export this data from


<tbody>
        <tr class="data1">
                <td>15*</td>
                <td>13</td>
                <td>11</td>
                <td>2</td>
                <td>0</td>
                <td>136.36</td>
                <td>3</td>
                <td nowrap="nowrap">not out</td>
                <td>2</td>
                <td></td>
                <td class="left" nowrap="nowrap">v <a href="/ci/content/team/1.html" class="data-link">England</a></td>
                <td class="left" nowrap="nowrap"><a href="/ci/content/ground/57160.html" class="data-link">Manchester</a></td>
                <td nowrap="nowrap"><b>7 Sep 2016</b></td>
                <td style="white-space: nowrap;"><a href="/ci/engine/match/913663.html" title="view the scorecard for this row">T20I # 566</a></td>
        </tr>
        <tr class="data1">
                <td>55*</td>
                <td>49</td>
                <td>37</td>
                <td>6</td>
                <td>2</td>
                <td>148.64</td>
                <td>3</td>
                <td nowrap="nowrap">not out</td>
                <td>2</td>
                <td></td>
                <td class="left" nowrap="nowrap">v <a href="/ci/content/team/4.html" class="data-link">West Indies</a></td>
                <td class="left" nowrap="nowrap"><a href="/ci/content/ground/392627.html" class="data-link">Dubai (DSC)</a></td>
                <td nowrap="nowrap"><b>23 Sep 2016</b></td>
                <td style="white-space: nowrap;"><a href="/ci/engine/match/1050217.html" title="view the scorecard for this row">T20I # 568</a></td>
        </tr>
        <tr class="data1">
                <td class="padAst">19</td>
                <td>28</td>
                <td>18</td>
                <td>2</td>
                <td>0</td>
                <td>105.55</td>
                <td>3</td>
                <td>caught</td>
                <td>1</td>
                <td></td>
                <td class="left" nowrap="nowrap">v <a href="/ci/content/team/4.html" class="data-link">West Indies</a></td>
                <td class="left" nowrap="nowrap"><a href="/ci/content/ground/392627.html" class="data-link">Dubai (DSC)</a></td>
                <td nowrap="nowrap"><b>24 Sep 2016</b></td>
                <td style="white-space: nowrap;"><a href="/ci/engine/match/1050219.html" title="view the scorecard for this row">T20I # 569</a></td>
        </tr>
        <tr class="data1">
                <td>27*</td>
                <td>42</td>
                <td>24</td>
                <td>1</td>
                <td>0</td>
                <td>112.50</td>
                <td>3</td>
                <td nowrap="nowrap">not out</td>
                <td>2</td>
                <td></td>
                <td class="left" nowrap="nowrap">v <a href="/ci/content/team/4.html" class="data-link">West Indies</a></td>
                <td class="left" nowrap="nowrap"><a href="/ci/content/ground/59396.html" class="data-link">Abu Dhabi</a></td>
                <td nowrap="nowrap"><b>27 Sep 2016</b></td>
                <td style="white-space: nowrap;"><a href="/ci/engine/match/1050221.html" title="view the scorecard for this row">T20I # 570</a></td>
        </tr>
Any help on the matter would be highly appreciated.

Thanks
Waqas
Quote
#2
Is having table as pandas dataframe is ok?

>>> df = pd.read_html('http://stats.espncricinfo.com/ci/engine/player/348144.htmlclass=3;template=results;type=batting;view=innings')
>>> df[3]
   Runs Mins  BF  ...           Ground   Start Date  Unnamed: 13
0   15*   13  11  ...       Manchester   7 Sep 2016   T20I # 566
1   55*   49  37  ...      Dubai (DSC)  23 Sep 2016   T20I # 568
2    19   28  18  ...      Dubai (DSC)  24 Sep 2016   T20I # 569
3   27*   42  24  ...        Abu Dhabi  27 Sep 2016   T20I # 570
4    29    -  30  ...       Bridgetown  26 Mar 2017   T20I # 602
5    27    -  28  ...    Port of Spain  30 Mar 2017   T20I # 603
6    43    -  38  ...    Port of Spain   1 Apr 2017   T20I # 604
7    38    -  36  ...    Port of Spain   2 Apr 2017   T20I # 605
8    86    -  52  ...           Lahore  12 Sep 2017   T20I # 619
9    45    -  38  ...           Lahore  13 Sep 2017   T20I # 620
10   48    -  31  ...           Lahore  15 Sep 2017   T20I # 621
11    1    -   8  ...        Abu Dhabi  26 Oct 2017   T20I # 625
12    1    -   2  ...        Abu Dhabi  27 Oct 2017   T20I # 627
13  34*    -  31  ...           Lahore  29 Oct 2017   T20I # 629
14   41   68  41  ...       Wellington  22 Jan 2018   T20I # 639
15  50*   45  29  ...         Auckland  25 Jan 2018   T20I # 640
16   18    -  17  ...  Mount Maunganui  28 Jan 2018   T20I # 641
17   17    -  13  ...          Karachi   1 Apr 2018   T20I # 663
18  97*    -  58  ...          Karachi   2 Apr 2018   T20I # 664
19   51    -  40  ...          Karachi   3 Apr 2018   T20I # 665
20  68*    -  55  ...        Abu Dhabi  24 Oct 2018   T20I # 701
21   45    -  44  ...      Dubai (DSC)  26 Oct 2018   T20I # 702
22   50    -  40  ...      Dubai (DSC)  28 Oct 2018   T20I # 704
23    7    -   9  ...        Abu Dhabi  31 Oct 2018   T20I # 705
24   40    -  41  ...      Dubai (DSC)   2 Nov 2018   T20I # 706
25   79   66  58  ...      Dubai (DSC)   4 Nov 2018   T20I # 708
26   38   55  27  ...        Cape Town   1 Feb 2019   T20I # 732
27   90   78  58  ...     Johannesburg   3 Feb 2019   T20I # 734
28   23   10  11  ...        Centurion   6 Feb 2019   T20I # 736
29   65    -  42  ...          Cardiff   5 May 2019   T20I # 772

[30 rows x 14 columns]
snippsat likes this post
I'm not 'in'-sane. Indeed, I am so far 'out' of sane that you appear a tiny blip on the distant coast of sanity. Bucky Katt, Get Fuzzy

Life of Brian: Conjugate the verb, "to go" !
Quote
#3
(Jul-11-2019, 06:50 AM)perfringo Wrote: Is having table as pandas dataframe is ok?

>>> df = pd.read_html('http://stats.espncricinfo.com/ci/engine/player/348144.htmlclass=3;template=results;type=batting;view=innings')
>>> df[3]
   Runs Mins  BF  ...           Ground   Start Date  Unnamed: 13
0   15*   13  11  ...       Manchester   7 Sep 2016   T20I # 566
1   55*   49  37  ...      Dubai (DSC)  23 Sep 2016   T20I # 568
2    19   28  18  ...      Dubai (DSC)  24 Sep 2016   T20I # 569
3   27*   42  24  ...        Abu Dhabi  27 Sep 2016   T20I # 570
4    29    -  30  ...       Bridgetown  26 Mar 2017   T20I # 602
5    27    -  28  ...    Port of Spain  30 Mar 2017   T20I # 603
6    43    -  38  ...    Port of Spain   1 Apr 2017   T20I # 604
7    38    -  36  ...    Port of Spain   2 Apr 2017   T20I # 605
8    86    -  52  ...           Lahore  12 Sep 2017   T20I # 619
9    45    -  38  ...           Lahore  13 Sep 2017   T20I # 620
10   48    -  31  ...           Lahore  15 Sep 2017   T20I # 621
11    1    -   8  ...        Abu Dhabi  26 Oct 2017   T20I # 625
12    1    -   2  ...        Abu Dhabi  27 Oct 2017   T20I # 627
13  34*    -  31  ...           Lahore  29 Oct 2017   T20I # 629
14   41   68  41  ...       Wellington  22 Jan 2018   T20I # 639
15  50*   45  29  ...         Auckland  25 Jan 2018   T20I # 640
16   18    -  17  ...  Mount Maunganui  28 Jan 2018   T20I # 641
17   17    -  13  ...          Karachi   1 Apr 2018   T20I # 663
18  97*    -  58  ...          Karachi   2 Apr 2018   T20I # 664
19   51    -  40  ...          Karachi   3 Apr 2018   T20I # 665
20  68*    -  55  ...        Abu Dhabi  24 Oct 2018   T20I # 701
21   45    -  44  ...      Dubai (DSC)  26 Oct 2018   T20I # 702
22   50    -  40  ...      Dubai (DSC)  28 Oct 2018   T20I # 704
23    7    -   9  ...        Abu Dhabi  31 Oct 2018   T20I # 705
24   40    -  41  ...      Dubai (DSC)   2 Nov 2018   T20I # 706
25   79   66  58  ...      Dubai (DSC)   4 Nov 2018   T20I # 708
26   38   55  27  ...        Cape Town   1 Feb 2019   T20I # 732
27   90   78  58  ...     Johannesburg   3 Feb 2019   T20I # 734
28   23   10  11  ...        Centurion   6 Feb 2019   T20I # 736
29   65    -  42  ...          Cardiff   5 May 2019   T20I # 772

[30 rows x 14 columns]

Do I have to restructure my code from scratch? Is there no way I can use my existing code to get the columns that I might need?

Thanks
Quote
#4
(Jul-11-2019, 05:12 AM)ahmedwaqas92 Wrote: I would most probably export them to a CSV file

Link appears to be dead, so generic example how to write specific columns to file:

>>> import pandas as pd
>>> d = {'ham': [1, 2, 3], 'spam': ['a', 'b', 'c'], 'bacon': ['1a', '2b', '3c']}
>>> df = pd.DataFrame(d)
>>> df
   ham spam bacon
0    1    a    1a
1    2    b    2b
2    3    c    3c
>>> df.to_csv('out.csv', columns=['ham', 'bacon'], index=False)
The content of out.csv is:

Output:
ham,bacon 1,1a 2,2b 3,3c
I'm not 'in'-sane. Indeed, I am so far 'out' of sane that you appear a tiny blip on the distant coast of sanity. Bucky Katt, Get Fuzzy

Life of Brian: Conjugate the verb, "to go" !
Quote
#5
(Jul-12-2019, 04:34 AM)ahmedwaqas92 Wrote: Do I have to restructure my code from scratch? Is there no way I can use my existing code to get the columns that I might need?
It's just a lot more work and it's not so easy either as have to make columns as it's not clear defined in html.
Need to clean up data to,if gone do plot or other stuff.
As shown by @perfringo Pandas make this a lot easier.

Here a NoteBook where i show some different stuff that may be needed,like clean up take out columns.
Quote
#6
(Jul-12-2019, 06:26 AM)perfringo Wrote:
(Jul-11-2019, 05:12 AM)ahmedwaqas92 Wrote: I would most probably export them to a CSV file

Link appears to be dead, so generic example how to write specific columns to file:

>>> import pandas as pd
>>> d = {'ham': [1, 2, 3], 'spam': ['a', 'b', 'c'], 'bacon': ['1a', '2b', '3c']}
>>> df = pd.DataFrame(d)
>>> df
   ham spam bacon
0    1    a    1a
1    2    b    2b
2    3    c    3c
>>> df.to_csv('out.csv', columns=['ham', 'bacon'], index=False)
The content of out.csv is:

Output:
ham,bacon 1,1a 2,2b 3,3c

(Jul-12-2019, 10:48 AM)snippsat Wrote:
(Jul-12-2019, 04:34 AM)ahmedwaqas92 Wrote: Do I have to restructure my code from scratch? Is there no way I can use my existing code to get the columns that I might need?
It's just a lot more work and it's not so easy either as have to make columns as it's not clear defined in html.
Need to clean up data to,if gone do plot or other stuff.
As shown by @perfringo Pandas make this a lot easier.

Here a NoteBook where i show some different stuff that may be needed,like clean up take out columns.


Apologies for the delayed response on the matter, I reviewed the detailed instructions on both these examples and then read a few docs pertaining to the use of data frames in Python. Based on all what I gathered in the last week or so I have managed to make a small Python script which captures the data as to how I want. Then it removes the excess columns, removes special characters and does some calculation as well. The results are finally presented in a pie chart - See code below

import pandas as pd
import matplotlib.pyplot as plt
dataMain = pd.read_html('http://stats.espncricinfo.com/ci/engine/player/422108.html?class=3;spanmin1=07+Sep+2016;spanval1=span;template=results;type=batting;view=innings')
dataTabulated = dataMain[3]
columns = [0,2,3,4]
tempFrame = pd.DataFrame(dataTabulated)
dataFinal = tempFrame[tempFrame.columns[columns]]
dataFinal = dataFinal[~dataFinal.Runs.str.contains("DNB")]
dataFinal = dataFinal.replace('\*','',regex=True).astype(float)


totalRuns = dataFinal['Runs'].sum()
ballsFaced = dataFinal['BF'].sum()
fours = dataFinal['4s'].sum()
sixes = dataFinal['6s'].sum()

totalFours = fours * 4
totalSixes = sixes * 6

#This is Calculating the Total runs Scored in Boundaries and Rotation(1s,2s,3s)
boundaryRuns = totalFours + totalSixes
rotationRuns = totalRuns - boundaryRuns

#This is Calculating the Percentage runs in Rotation(1s,2s,3s) & in Boundaries (1s,2s,3s)
rotationRuns_p = rotationRuns / totalRuns * 100
boundaryRuns_p = boundaryRuns / totalRuns * 100

#Calculating Approximate & Percentage Dot balls
forBall = ballsFaced-(fours+sixes)
forRun = totalRuns-(totalFours+totalSixes)
approxDot = forBall-forRun
approxDot_p = approxDot/ballsFaced * 100
score_P = 100 - approxDot_p

print(round(approxDot_p,2))

#Plotting for Boundaries / Rotation Ratio
labels = 'Rotation', 'Boundaries'
sizes = [rotationRuns_p, boundaryRuns_p]
colors = ['yellowgreen', 'yellow']
explode = [0.1, 0] #Explode 1st Slice

#Plotting the Pie Chart
plt.pie(sizes, explode=explode, labels=labels, colors=colors, autopct='%1.1f%%', shadow=True, startangle=140)
plt.axis('equal')
plt.show()

#Plotting for Dot Ball Percentage
labels = 'Approx Dot %', 'Scoring %'
sizes = [approxDot_p, score_P]
colors = ['red', 'blue']
explode = [0.1, 0] #Explode 1st Slice

#Plotting the Pie Chart
plt.pie(sizes, explode=explode, labels=labels, colors=colors, autopct='%1.1f%%', shadow=True, startangle=140)
plt.axis('equal')
plt.show()

I guess now it seems that my problem has been resolved so you can mark this thread as solved. Your help on the matter @perfringo & @snippsat is much appreciated :)
Quote

Top Page

Possibly Related Threads...
Thread Author Replies Views Last Post
  BeautifulSoup: Error while extracting a value from an HTML table kawasso 0 29 7 hours ago
Last Post: kawasso
  HTML Styling Not Working yoitspython 1 113 Aug-13-2019, 06:26 AM
Last Post: fishhook
  spliting html code with br tag yokaso 11 274 Aug-07-2019, 03:18 PM
Last Post: snippsat
  How do I get rid of the HTML tags in my output? glittergirl 1 370 Aug-05-2019, 08:30 PM
Last Post: snippsat
  convert html table to json bhojendra 5 174 Jul-30-2019, 07:53 PM
Last Post: DeaD_EyE
  Getting a specific text inside an html with soup mathieugrimbert 9 390 Jul-10-2019, 12:40 PM
Last Post: mathieugrimbert
  getting options from a html form pgoosen 5 352 Jul-03-2019, 06:07 PM
Last Post: nilamo
  [Flask] html error 405 SheeppOSU 0 184 Jun-08-2019, 04:42 PM
Last Post: SheeppOSU
  [split] Using beautiful soup to get html attribute value moski 6 387 Jun-03-2019, 04:24 PM
Last Post: moski
  html error 404 SheeppOSU 1 217 Jun-03-2019, 02:19 PM
Last Post: heiner55

Forum Jump:


Users browsing this thread: 1 Guest(s)