Jul-11-2019, 05:12 AM
Hello,
I am trying to get complete values of three distinct columns from an online HTML table that has 13 columns in total. I have parsed the Html content using beautiful soup and can display the values as text files using the code below. However, the output generated just lists out everything in the table and I can't seem to figure out how to simply extract the columns that I need. The code (which I have written until now) is as below:
The data that I am scraping here is hosted in here http://stats.espncricinfo.com/ci/engine/...ew=innings and below are the first few HTML tags that I am trying to export this data from
Thanks
Waqas
I am trying to get complete values of three distinct columns from an online HTML table that has 13 columns in total. I have parsed the Html content using beautiful soup and can display the values as text files using the code below. However, the output generated just lists out everything in the table and I can't seem to figure out how to simply extract the columns that I need. The code (which I have written until now) is as below:
import re from bs4 import BeautifulSoup as soup from urllib.request import urlopen as uReq my_url = 'http://stats.espncricinfo.com/ci/engine/player/348144.html?class=3;template=results;type=batting;view=innings' uClient = uReq(my_url) page_html = uClient.read() uClient.close() page_soup = soup(page_html, "html.parser") container_main = page_soup.findAll("div", {"id":"ciHomeContent"}) container_main = container_main[0] container_secondary = container_main.findAll("div", {"id":"ciMainContainer"}) container_secondary = container_secondary[0] container_tertiary = container_secondary.findAll("div", {"id":"ciHomeContentlhs"}) container_tertiary = container_tertiary[0] container_sublevel = container_tertiary.findAll("div", {"class":"pnl650M"}) container_sublevel = container_sublevel[0] container_mainTable = container_sublevel.findAll("table", {"class":"engineTable"}) container_mainTable = container_mainTable[3] for table in container_mainTable: Test = container_mainTable.tbody Test = Test.text print (Test)This code gives me the following output:
15* 13 11 2 0 136.36 3 not out 2 v England Manchester 7 Sep 2016 T20I # 566 ----------------------------------------------New Record Begins (Inserted by User)------------------------------------------- 55* 49 37 6 2 148.64 3 not out 2 v West Indies Dubai (DSC) 23 Sep 2016 T20I # 568 -----------------------------------------------New Record Begins (Until the final row in the table)---------------------------------------From the above output, I just want values (which are denoted by between a '[td][/td]' tag) in the first column (15*, 55*.......). Values in 3rd Column (11, 37.....) and values in the 4th (2, 6.....) & 5th (0, 2......) columns. Following that, I would most probably export them to a CSV file along with generating Gnuplot graphs and other charts.
The data that I am scraping here is hosted in here http://stats.espncricinfo.com/ci/engine/...ew=innings and below are the first few HTML tags that I am trying to export this data from
<tbody> <tr class="data1"> <td>15*</td> <td>13</td> <td>11</td> <td>2</td> <td>0</td> <td>136.36</td> <td>3</td> <td nowrap="nowrap">not out</td> <td>2</td> <td></td> <td class="left" nowrap="nowrap">v <a href="/ci/content/team/1.html" class="data-link">England</a></td> <td class="left" nowrap="nowrap"><a href="/ci/content/ground/57160.html" class="data-link">Manchester</a></td> <td nowrap="nowrap"><b>7 Sep 2016</b></td> <td style="white-space: nowrap;"><a href="/ci/engine/match/913663.html" title="view the scorecard for this row">T20I # 566</a></td> </tr> <tr class="data1"> <td>55*</td> <td>49</td> <td>37</td> <td>6</td> <td>2</td> <td>148.64</td> <td>3</td> <td nowrap="nowrap">not out</td> <td>2</td> <td></td> <td class="left" nowrap="nowrap">v <a href="/ci/content/team/4.html" class="data-link">West Indies</a></td> <td class="left" nowrap="nowrap"><a href="/ci/content/ground/392627.html" class="data-link">Dubai (DSC)</a></td> <td nowrap="nowrap"><b>23 Sep 2016</b></td> <td style="white-space: nowrap;"><a href="/ci/engine/match/1050217.html" title="view the scorecard for this row">T20I # 568</a></td> </tr> <tr class="data1"> <td class="padAst">19</td> <td>28</td> <td>18</td> <td>2</td> <td>0</td> <td>105.55</td> <td>3</td> <td>caught</td> <td>1</td> <td></td> <td class="left" nowrap="nowrap">v <a href="/ci/content/team/4.html" class="data-link">West Indies</a></td> <td class="left" nowrap="nowrap"><a href="/ci/content/ground/392627.html" class="data-link">Dubai (DSC)</a></td> <td nowrap="nowrap"><b>24 Sep 2016</b></td> <td style="white-space: nowrap;"><a href="/ci/engine/match/1050219.html" title="view the scorecard for this row">T20I # 569</a></td> </tr> <tr class="data1"> <td>27*</td> <td>42</td> <td>24</td> <td>1</td> <td>0</td> <td>112.50</td> <td>3</td> <td nowrap="nowrap">not out</td> <td>2</td> <td></td> <td class="left" nowrap="nowrap">v <a href="/ci/content/team/4.html" class="data-link">West Indies</a></td> <td class="left" nowrap="nowrap"><a href="/ci/content/ground/59396.html" class="data-link">Abu Dhabi</a></td> <td nowrap="nowrap"><b>27 Sep 2016</b></td> <td style="white-space: nowrap;"><a href="/ci/engine/match/1050221.html" title="view the scorecard for this row">T20I # 570</a></td> </tr>Any help on the matter would be highly appreciated.
Thanks
Waqas