How to capture Single Column from Web Html Table? - Printable Version +- Python Forum (https://python-forum.io) +-- Forum: Python Coding (https://python-forum.io/forum-7.html) +--- Forum: Web Scraping & Web Development (https://python-forum.io/forum-13.html) +--- Thread: How to capture Single Column from Web Html Table? (/thread-19699.html) |
How to capture Single Column from Web Html Table? - ahmedwaqas92 - Jul-11-2019 Hello, I am trying to get complete values of three distinct columns from an online HTML table that has 13 columns in total. I have parsed the Html content using beautiful soup and can display the values as text files using the code below. However, the output generated just lists out everything in the table and I can't seem to figure out how to simply extract the columns that I need. The code (which I have written until now) is as below: import re from bs4 import BeautifulSoup as soup from urllib.request import urlopen as uReq my_url = 'http://stats.espncricinfo.com/ci/engine/player/348144.html?class=3;template=results;type=batting;view=innings' uClient = uReq(my_url) page_html = uClient.read() uClient.close() page_soup = soup(page_html, "html.parser") container_main = page_soup.findAll("div", {"id":"ciHomeContent"}) container_main = container_main[0] container_secondary = container_main.findAll("div", {"id":"ciMainContainer"}) container_secondary = container_secondary[0] container_tertiary = container_secondary.findAll("div", {"id":"ciHomeContentlhs"}) container_tertiary = container_tertiary[0] container_sublevel = container_tertiary.findAll("div", {"class":"pnl650M"}) container_sublevel = container_sublevel[0] container_mainTable = container_sublevel.findAll("table", {"class":"engineTable"}) container_mainTable = container_mainTable[3] for table in container_mainTable: Test = container_mainTable.tbody Test = Test.text print (Test)This code gives me the following output: 15* 13 11 2 0 136.36 3 not out 2 v England Manchester 7 Sep 2016 T20I # 566 ----------------------------------------------New Record Begins (Inserted by User)------------------------------------------- 55* 49 37 6 2 148.64 3 not out 2 v West Indies Dubai (DSC) 23 Sep 2016 T20I # 568 -----------------------------------------------New Record Begins (Until the final row in the table)---------------------------------------From the above output, I just want values (which are denoted by between a '[td][/td]' tag) in the first column (15*, 55*.......). Values in 3rd Column (11, 37.....) and values in the 4th (2, 6.....) & 5th (0, 2......) columns. Following that, I would most probably export them to a CSV file along with generating Gnuplot graphs and other charts. The data that I am scraping here is hosted in here http://stats.espncricinfo.com/ci/engine/player/348144.html?class=3;template=results;type=batting;view=innings and below are the first few HTML tags that I am trying to export this data from <tbody> <tr class="data1"> <td>15*</td> <td>13</td> <td>11</td> <td>2</td> <td>0</td> <td>136.36</td> <td>3</td> <td nowrap="nowrap">not out</td> <td>2</td> <td></td> <td class="left" nowrap="nowrap">v <a href="/ci/content/team/1.html" class="data-link">England</a></td> <td class="left" nowrap="nowrap"><a href="/ci/content/ground/57160.html" class="data-link">Manchester</a></td> <td nowrap="nowrap"><b>7 Sep 2016</b></td> <td style="white-space: nowrap;"><a href="/ci/engine/match/913663.html" title="view the scorecard for this row">T20I # 566</a></td> </tr> <tr class="data1"> <td>55*</td> <td>49</td> <td>37</td> <td>6</td> <td>2</td> <td>148.64</td> <td>3</td> <td nowrap="nowrap">not out</td> <td>2</td> <td></td> <td class="left" nowrap="nowrap">v <a href="/ci/content/team/4.html" class="data-link">West Indies</a></td> <td class="left" nowrap="nowrap"><a href="/ci/content/ground/392627.html" class="data-link">Dubai (DSC)</a></td> <td nowrap="nowrap"><b>23 Sep 2016</b></td> <td style="white-space: nowrap;"><a href="/ci/engine/match/1050217.html" title="view the scorecard for this row">T20I # 568</a></td> </tr> <tr class="data1"> <td class="padAst">19</td> <td>28</td> <td>18</td> <td>2</td> <td>0</td> <td>105.55</td> <td>3</td> <td>caught</td> <td>1</td> <td></td> <td class="left" nowrap="nowrap">v <a href="/ci/content/team/4.html" class="data-link">West Indies</a></td> <td class="left" nowrap="nowrap"><a href="/ci/content/ground/392627.html" class="data-link">Dubai (DSC)</a></td> <td nowrap="nowrap"><b>24 Sep 2016</b></td> <td style="white-space: nowrap;"><a href="/ci/engine/match/1050219.html" title="view the scorecard for this row">T20I # 569</a></td> </tr> <tr class="data1"> <td>27*</td> <td>42</td> <td>24</td> <td>1</td> <td>0</td> <td>112.50</td> <td>3</td> <td nowrap="nowrap">not out</td> <td>2</td> <td></td> <td class="left" nowrap="nowrap">v <a href="/ci/content/team/4.html" class="data-link">West Indies</a></td> <td class="left" nowrap="nowrap"><a href="/ci/content/ground/59396.html" class="data-link">Abu Dhabi</a></td> <td nowrap="nowrap"><b>27 Sep 2016</b></td> <td style="white-space: nowrap;"><a href="/ci/engine/match/1050221.html" title="view the scorecard for this row">T20I # 570</a></td> </tr>Any help on the matter would be highly appreciated. Thanks Waqas RE: How to capture Single Column from Web Html Table? - perfringo - Jul-11-2019 Is having table as pandas dataframe is ok? >>> df = pd.read_html('http://stats.espncricinfo.com/ci/engine/player/348144.htmlclass=3;template=results;type=batting;view=innings') >>> df[3] Runs Mins BF ... Ground Start Date Unnamed: 13 0 15* 13 11 ... Manchester 7 Sep 2016 T20I # 566 1 55* 49 37 ... Dubai (DSC) 23 Sep 2016 T20I # 568 2 19 28 18 ... Dubai (DSC) 24 Sep 2016 T20I # 569 3 27* 42 24 ... Abu Dhabi 27 Sep 2016 T20I # 570 4 29 - 30 ... Bridgetown 26 Mar 2017 T20I # 602 5 27 - 28 ... Port of Spain 30 Mar 2017 T20I # 603 6 43 - 38 ... Port of Spain 1 Apr 2017 T20I # 604 7 38 - 36 ... Port of Spain 2 Apr 2017 T20I # 605 8 86 - 52 ... Lahore 12 Sep 2017 T20I # 619 9 45 - 38 ... Lahore 13 Sep 2017 T20I # 620 10 48 - 31 ... Lahore 15 Sep 2017 T20I # 621 11 1 - 8 ... Abu Dhabi 26 Oct 2017 T20I # 625 12 1 - 2 ... Abu Dhabi 27 Oct 2017 T20I # 627 13 34* - 31 ... Lahore 29 Oct 2017 T20I # 629 14 41 68 41 ... Wellington 22 Jan 2018 T20I # 639 15 50* 45 29 ... Auckland 25 Jan 2018 T20I # 640 16 18 - 17 ... Mount Maunganui 28 Jan 2018 T20I # 641 17 17 - 13 ... Karachi 1 Apr 2018 T20I # 663 18 97* - 58 ... Karachi 2 Apr 2018 T20I # 664 19 51 - 40 ... Karachi 3 Apr 2018 T20I # 665 20 68* - 55 ... Abu Dhabi 24 Oct 2018 T20I # 701 21 45 - 44 ... Dubai (DSC) 26 Oct 2018 T20I # 702 22 50 - 40 ... Dubai (DSC) 28 Oct 2018 T20I # 704 23 7 - 9 ... Abu Dhabi 31 Oct 2018 T20I # 705 24 40 - 41 ... Dubai (DSC) 2 Nov 2018 T20I # 706 25 79 66 58 ... Dubai (DSC) 4 Nov 2018 T20I # 708 26 38 55 27 ... Cape Town 1 Feb 2019 T20I # 732 27 90 78 58 ... Johannesburg 3 Feb 2019 T20I # 734 28 23 10 11 ... Centurion 6 Feb 2019 T20I # 736 29 65 - 42 ... Cardiff 5 May 2019 T20I # 772 [30 rows x 14 columns] RE: How to capture Single Column from Web Html Table? - ahmedwaqas92 - Jul-12-2019 (Jul-11-2019, 06:50 AM)perfringo Wrote: Is having table as pandas dataframe is ok? Do I have to restructure my code from scratch? Is there no way I can use my existing code to get the columns that I might need? Thanks RE: How to capture Single Column from Web Html Table? - perfringo - Jul-12-2019 (Jul-11-2019, 05:12 AM)ahmedwaqas92 Wrote: I would most probably export them to a CSV file Link appears to be dead, so generic example how to write specific columns to file: >>> import pandas as pd >>> d = {'ham': [1, 2, 3], 'spam': ['a', 'b', 'c'], 'bacon': ['1a', '2b', '3c']} >>> df = pd.DataFrame(d) >>> df ham spam bacon 0 1 a 1a 1 2 b 2b 2 3 c 3c >>> df.to_csv('out.csv', columns=['ham', 'bacon'], index=False)The content of out.csv is:
RE: How to capture Single Column from Web Html Table? - snippsat - Jul-12-2019 (Jul-12-2019, 04:34 AM)ahmedwaqas92 Wrote: Do I have to restructure my code from scratch? Is there no way I can use my existing code to get the columns that I might need?It's just a lot more work and it's not so easy either as have to make columns as it's not clear defined in html. Need to clean up data to,if gone do plot or other stuff. As shown by @perfringo Pandas make this a lot easier. Here a NoteBook where i show some different stuff that may be needed,like clean up take out columns. RE: How to capture Single Column from Web Html Table? - ahmedwaqas92 - Jul-29-2019 (Jul-12-2019, 06:26 AM)perfringo Wrote:(Jul-11-2019, 05:12 AM)ahmedwaqas92 Wrote: I would most probably export them to a CSV file (Jul-12-2019, 10:48 AM)snippsat Wrote:(Jul-12-2019, 04:34 AM)ahmedwaqas92 Wrote: Do I have to restructure my code from scratch? Is there no way I can use my existing code to get the columns that I might need?It's just a lot more work and it's not so easy either as have to make columns as it's not clear defined in html. Apologies for the delayed response on the matter, I reviewed the detailed instructions on both these examples and then read a few docs pertaining to the use of data frames in Python. Based on all what I gathered in the last week or so I have managed to make a small Python script which captures the data as to how I want. Then it removes the excess columns, removes special characters and does some calculation as well. The results are finally presented in a pie chart - See code below import pandas as pd import matplotlib.pyplot as plt dataMain = pd.read_html('http://stats.espncricinfo.com/ci/engine/player/422108.html?class=3;spanmin1=07+Sep+2016;spanval1=span;template=results;type=batting;view=innings') dataTabulated = dataMain[3] columns = [0,2,3,4] tempFrame = pd.DataFrame(dataTabulated) dataFinal = tempFrame[tempFrame.columns[columns]] dataFinal = dataFinal[~dataFinal.Runs.str.contains("DNB")] dataFinal = dataFinal.replace('\*','',regex=True).astype(float) totalRuns = dataFinal['Runs'].sum() ballsFaced = dataFinal['BF'].sum() fours = dataFinal['4s'].sum() sixes = dataFinal['6s'].sum() totalFours = fours * 4 totalSixes = sixes * 6 #This is Calculating the Total runs Scored in Boundaries and Rotation(1s,2s,3s) boundaryRuns = totalFours + totalSixes rotationRuns = totalRuns - boundaryRuns #This is Calculating the Percentage runs in Rotation(1s,2s,3s) & in Boundaries (1s,2s,3s) rotationRuns_p = rotationRuns / totalRuns * 100 boundaryRuns_p = boundaryRuns / totalRuns * 100 #Calculating Approximate & Percentage Dot balls forBall = ballsFaced-(fours+sixes) forRun = totalRuns-(totalFours+totalSixes) approxDot = forBall-forRun approxDot_p = approxDot/ballsFaced * 100 score_P = 100 - approxDot_p print(round(approxDot_p,2)) #Plotting for Boundaries / Rotation Ratio labels = 'Rotation', 'Boundaries' sizes = [rotationRuns_p, boundaryRuns_p] colors = ['yellowgreen', 'yellow'] explode = [0.1, 0] #Explode 1st Slice #Plotting the Pie Chart plt.pie(sizes, explode=explode, labels=labels, colors=colors, autopct='%1.1f%%', shadow=True, startangle=140) plt.axis('equal') plt.show() #Plotting for Dot Ball Percentage labels = 'Approx Dot %', 'Scoring %' sizes = [approxDot_p, score_P] colors = ['red', 'blue'] explode = [0.1, 0] #Explode 1st Slice #Plotting the Pie Chart plt.pie(sizes, explode=explode, labels=labels, colors=colors, autopct='%1.1f%%', shadow=True, startangle=140) plt.axis('equal') plt.show() I guess now it seems that my problem has been resolved so you can mark this thread as solved. Your help on the matter @perfringo & @snippsat is much appreciated :) |