Python Forum
BeautifulSoup - extract table but not using ID
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
BeautifulSoup - extract table but not using ID
#1
Hi,

I am scraping data from a web page but none of the items have an ID. I'm struggling to find an example so here goes...

The table looks like such:

Output:
<table style = " border-collapse: collapse;" border="0" cellPadding="0" summary="Transactions statistics summary table" class="750WidthClass" > <tr bgcolor="330066" > <td id="LraTransaction Name&nbsp;" class="table_header_for_html_report" vAlign="top"><span class="Verdana2">Transaction Name&nbsp;</span></td> <td id="Status" class="table_header" vAlign="top" width="80"><span class="Verdana2">SLA Status</span></td> <td id="LraMinimum&nbsp;" class="table_header_for_html_report" vAlign="top"><span class="Verdana2">Minimum&nbsp;</span></td> <td id="LraAverage&nbsp;" class="table_header_for_html_report" vAlign="top"><span class="Verdana2">Average&nbsp;</span></td> <td id="LraMaximum&nbsp;" class="table_header_for_html_report" vAlign="top"><span class="Verdana2">Maximum&nbsp;</span></td> <td id="LraStd. Deviation&nbsp;" class="table_header_for_html_report" vAlign="top"><span class="Verdana2">Std. Deviation&nbsp;</span></td> <td id="Lra80 Percent&nbsp;" class="table_header_for_html_report" vAlign="top"><span class="Verdana2">80 Percent&nbsp;</span></td> <td id="LraPass&nbsp;" class="table_header_for_html_report" vAlign="top"><span class="Verdana2">Pass&nbsp;</span></td> <td id="LraFail&nbsp;" class="table_header_for_html_report" vAlign="top"><span class="Verdana2">Fail&nbsp;</span></td> <td id="LraStop&nbsp;" class="table_header_for_html_report" vAlign="top"><span class="Verdana2">Stop&nbsp;</span></td> </tr> </table>
All examples online point to using ID. But that doesn't exist.

There are several tables on the page but to uniquely identify the one above, I'd want something like:

find/findall in table. Unique identifier: summary="Transactions statistics summary table"

for each row in the table, extract values. Unique identifier: class="Verdana2">

I want to retrieve the values:
Transaction Name&nbsp;
LraMinimum&nbsp;
Average&nbsp;
Maximum&nbsp;
Std. Deviation&nbsp;
80 Percent&nbsp;
Pass&nbsp;
Fail&nbsp;
Stop&nbsp;

Hope that makes sense?

Cheers,
J
Reply
#2
try:
find_all('span' class="Verdana2")
Reply
#3
Quote:There are several tables on the page but to uniquely identify the one above,
An ID is the only thing that can surely identify 100% from others. Sometimes you get lucky and the class name is the only one used in that tag you are searching for on that page, and sometimes you just have to pick the 4th table out from your results.

soup.find('table', {'class':'750WidthClass'})
but if the other tables have that same class, then you will need to get them all, then get the nth number of the tables with that class.
tables = soup.find_all('table', {'class':'750WidthClass'})
print(tables[3])
assuming it is the 4th table with that class
Recommended Tutorials:
Reply
#4
from bs4 import BeautifulSoup

html = '''\
<table style=" border-collapse: collapse;" border="0" cellPadding="0" summary="Transactions statistics summary table" class="750WidthClass">
  <tr bgcolor="330066">
    <td id="LraTransaction Name&nbsp;" class="table_header_for_html_report" vAlign="top"><span class="Verdana2">Transaction Name&nbsp;</span></td>
    <td id="Status" class="table_header" vAlign="top" width="80"><span class="Verdana2">SLA Status</span></td>
    <td id="LraMinimum&nbsp;" class="table_header_for_html_report" vAlign="top"><span class="Verdana2">Minimum&nbsp;</span></td>
    <td id="LraAverage&nbsp;" class="table_header_for_html_report" vAlign="top"><span class="Verdana2">Average&nbsp;</span></td>
    <td id="LraMaximum&nbsp;" class="table_header_for_html_report" vAlign="top"><span class="Verdana2">Maximum&nbsp;</span></td>
    <td id="LraStd. Deviation&nbsp;" class="table_header_for_html_report" vAlign="top"><span class="Verdana2">Std. Deviation&nbsp;</span></td>
    <td id="Lra80 Percent&nbsp;" class="table_header_for_html_report" vAlign="top"><span class="Verdana2">80 Percent&nbsp;</span></td>
    <td id="LraPass&nbsp;" class="table_header_for_html_report" vAlign="top"><span class="Verdana2">Pass&nbsp;</span></td>
    <td id="LraFail&nbsp;" class="table_header_for_html_report" vAlign="top"><span class="Verdana2">Fail&nbsp;</span></td>
    <td id="LraStop&nbsp;" class="table_header_for_html_report" vAlign="top"><span class="Verdana2">Stop&nbsp;</span></td>
  </tr>
</table>'''

soup = BeautifulSoup(html, 'lxml')
table = soup.find(class_="750WidthClass")
verdana = table.select('.Verdana2')
for item in verdana:
    print(item.text)
Output:
Transaction Name  SLA Status Minimum  Average  Maximum  Std. Deviation  80 Percent  Pass  Fail  Stop 
Reply
#5
@Larz60+ and @metulburr

I couldn't get either to work.

In Larz's solution, it complained about 'class'. When I typed 'cl', the closest keyword/method was classmethod which just dumped the whole table.

In metulburr's solution, it would only return the first table and not all of them as a list.

I was getting a bit closer, had a dirty solution where I could grab the whole table. Got that far and was then just going to split it using Verdana2 as a delimiter.

I so far had:
test1 = soup.find_all("table", attrs={"summary": "Transactions statistics summary table"})
 print (test1)
I've just tried Snippsat's solution and it works a treat so I'm going to leave mine for now as I doubt it was the proper way of doing it.

One thing I noticed is that Snippsat used a different parser than mine. I was using html.parser. I guess I should have made this clear, sorry. I didn't realise that it made a difference so likelihood is that I was trying something with the wrong parser...doh! Sorry about that.

This python is slowly killing me one day at a time.... I come from a C++ background and struggling as there seems to be many ways to skin a cat with this. I guess it will make sense in due course.

Thanks for your help!

Cheers,
J

(Jan-04-2018, 11:40 AM)snippsat Wrote:
from bs4 import BeautifulSoup

html = '''\
<table style=" border-collapse: collapse;" border="0" cellPadding="0" summary="Transactions statistics summary table" class="750WidthClass">
  <tr bgcolor="330066">
    <td id="LraTransaction Name&nbsp;" class="table_header_for_html_report" vAlign="top"><span class="Verdana2">Transaction Name&nbsp;</span></td>
    <td id="Status" class="table_header" vAlign="top" width="80"><span class="Verdana2">SLA Status</span></td>
    <td id="LraMinimum&nbsp;" class="table_header_for_html_report" vAlign="top"><span class="Verdana2">Minimum&nbsp;</span></td>
    <td id="LraAverage&nbsp;" class="table_header_for_html_report" vAlign="top"><span class="Verdana2">Average&nbsp;</span></td>
    <td id="LraMaximum&nbsp;" class="table_header_for_html_report" vAlign="top"><span class="Verdana2">Maximum&nbsp;</span></td>
    <td id="LraStd. Deviation&nbsp;" class="table_header_for_html_report" vAlign="top"><span class="Verdana2">Std. Deviation&nbsp;</span></td>
    <td id="Lra80 Percent&nbsp;" class="table_header_for_html_report" vAlign="top"><span class="Verdana2">80 Percent&nbsp;</span></td>
    <td id="LraPass&nbsp;" class="table_header_for_html_report" vAlign="top"><span class="Verdana2">Pass&nbsp;</span></td>
    <td id="LraFail&nbsp;" class="table_header_for_html_report" vAlign="top"><span class="Verdana2">Fail&nbsp;</span></td>
    <td id="LraStop&nbsp;" class="table_header_for_html_report" vAlign="top"><span class="Verdana2">Stop&nbsp;</span></td>
  </tr>
</table>'''

soup = BeautifulSoup(html, 'lxml')
table = soup.find(class_="750WidthClass")
verdana = table.select('.Verdana2')
for item in verdana:
    print(item.text)
Output:
Transaction Name  SLA Status Minimum  Average  Maximum  Std. Deviation  80 Percent  Pass  Fail  Stop 

Thanks mate! Worked a treat!!! You've just saved what little hair I have left ha ha.

I really appreciate your help with this. I am struggling but getting there. Now, to go back to the documentation and fully understand your solution.

Many thanks!

J
Reply
#6
(Jan-04-2018, 05:05 AM)metulburr Wrote:
Quote:There are several tables on the page but to uniquely identify the one above,
An ID is the only thing that can surely identify 100% from others. Sometimes you get lucky and the class name is the only one used in that tag you are searching for on that page, and sometimes you just have to pick the 4th table out from your results.

soup.find('table', {'class':'750WidthClass'})
but if the other tables have that same class, then you will need to get them all, then get the nth number of the tables with that class.
tables = soup.find_all('table', {'class':'750WidthClass'})
print(tables[3])
assuming it is the 4th table with that class

Thanks for this solution! I haven't figure out that same class name of tables go in to list :)

Best Regards from Serbia my friend :)
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  Trying to extract style attribute with BeautifulSoup knight2000 1 2,988 Dec-28-2022, 03:06 AM
Last Post: knight2000
  Extract data from a table Bob_M 3 2,627 Aug-14-2020, 03:36 PM
Last Post: Bob_M
  Extract data with Selenium and BeautifulSoup nestor 3 3,816 Jun-06-2020, 01:34 AM
Last Post: Larz60+
  Beautifulsoup table question tantony 5 2,761 Sep-30-2019, 03:26 PM
Last Post: tantony
  BeautifulSoup: Error while extracting a value from an HTML table kawasso 3 3,156 Aug-25-2019, 01:13 AM
Last Post: kawasso
  How to get hyperlinks in to the table extracted by BeautifulSoup KenniT 2 4,900 Apr-04-2018, 10:05 AM
Last Post: DeaD_EyE
  BeautifulSoup - Table tkj80 6 9,685 Oct-21-2016, 01:23 AM
Last Post: metulburr

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020