BeautifulSoup - extract table but not using ID

jonesin1974 · Jan-04-2018, 02:49 AM

Hi,

I am scraping data from a web page but none of the items have an ID. I'm struggling to find an example so here goes...

The table looks like such:

Output:<table style = " border-collapse: collapse;" border="0" cellPadding="0" summary="Transactions statistics summary table"  class="750WidthClass" >
<tr  bgcolor="330066" >
        <td id="LraTransaction Name&nbsp;" class="table_header_for_html_report" vAlign="top"><span class="Verdana2">Transaction Name&nbsp;</span></td>
<td id="Status" class="table_header" vAlign="top" width="80"><span class="Verdana2">SLA Status</span></td>        <td id="LraMinimum&nbsp;" class="table_header_for_html_report" vAlign="top"><span class="Verdana2">Minimum&nbsp;</span></td>
        <td id="LraAverage&nbsp;" class="table_header_for_html_report" vAlign="top"><span class="Verdana2">Average&nbsp;</span></td>
        <td id="LraMaximum&nbsp;" class="table_header_for_html_report" vAlign="top"><span class="Verdana2">Maximum&nbsp;</span></td>
        <td id="LraStd. Deviation&nbsp;" class="table_header_for_html_report" vAlign="top"><span class="Verdana2">Std. Deviation&nbsp;</span></td>
        <td id="Lra80 Percent&nbsp;" class="table_header_for_html_report" vAlign="top"><span class="Verdana2">80 Percent&nbsp;</span></td>
        <td id="LraPass&nbsp;" class="table_header_for_html_report" vAlign="top"><span class="Verdana2">Pass&nbsp;</span></td>
        <td id="LraFail&nbsp;" class="table_header_for_html_report" vAlign="top"><span class="Verdana2">Fail&nbsp;</span></td>
        <td id="LraStop&nbsp;" class="table_header_for_html_report" vAlign="top"><span class="Verdana2">Stop&nbsp;</span></td>
</tr>
</table>

All examples online point to using ID. But that doesn't exist.

There are several tables on the page but to uniquely identify the one above, I'd want something like:

find/findall in table. Unique identifier: summary="Transactions statistics summary table"

for each row in the table, extract values. Unique identifier: class="Verdana2">

I want to retrieve the values:
Transaction Name 
LraMinimum 
Average 
Maximum 
Std. Deviation 
80 Percent 
Pass 
Fail 
Stop 

Hope that makes sense?

Cheers,
J

**Larz60+** · Jan-04-2018, 03:12 AM

try:

find_all('span' class="Verdana2")

***metulburr*** · (This post was last modified: Jan-04-2018, 05:05 AM by metulburr.)

Quote:There are several tables on the page but to uniquely identify the one above,

An ID is the only thing that can surely identify 100% from others. Sometimes you get lucky and the class name is the only one used in that tag you are searching for on that page, and sometimes you just have to pick the 4th table out from your results.

soup.find('table', {'class':'750WidthClass'})

but if the other tables have that same class, then you will need to get them all, then get the nth number of the tables with that class.

tables = soup.find_all('table', {'class':'750WidthClass'})
print(tables[3])

assuming it is the 4th table with that class

***snippsat*** · Jan-04-2018, 11:40 AM

from bs4 import BeautifulSoup

html = '''\
<table style=" border-collapse: collapse;" border="0" cellPadding="0" summary="Transactions statistics summary table" class="750WidthClass">
  <tr bgcolor="330066">
    <td id="LraTransaction Name&nbsp;" class="table_header_for_html_report" vAlign="top"><span class="Verdana2">Transaction Name&nbsp;</span></td>
    <td id="Status" class="table_header" vAlign="top" width="80"><span class="Verdana2">SLA Status</span></td>
    <td id="LraMinimum&nbsp;" class="table_header_for_html_report" vAlign="top"><span class="Verdana2">Minimum&nbsp;</span></td>
    <td id="LraAverage&nbsp;" class="table_header_for_html_report" vAlign="top"><span class="Verdana2">Average&nbsp;</span></td>
    <td id="LraMaximum&nbsp;" class="table_header_for_html_report" vAlign="top"><span class="Verdana2">Maximum&nbsp;</span></td>
    <td id="LraStd. Deviation&nbsp;" class="table_header_for_html_report" vAlign="top"><span class="Verdana2">Std. Deviation&nbsp;</span></td>
    <td id="Lra80 Percent&nbsp;" class="table_header_for_html_report" vAlign="top"><span class="Verdana2">80 Percent&nbsp;</span></td>
    <td id="LraPass&nbsp;" class="table_header_for_html_report" vAlign="top"><span class="Verdana2">Pass&nbsp;</span></td>
    <td id="LraFail&nbsp;" class="table_header_for_html_report" vAlign="top"><span class="Verdana2">Fail&nbsp;</span></td>
    <td id="LraStop&nbsp;" class="table_header_for_html_report" vAlign="top"><span class="Verdana2">Stop&nbsp;</span></td>
  </tr>
</table>'''

soup = BeautifulSoup(html, 'lxml')
table = soup.find(class_="750WidthClass")
verdana = table.select('.Verdana2')
for item in verdana:
    print(item.text)

Output:Transaction Name 
SLA Status
Minimum 
Average 
Maximum 
Std. Deviation 
80 Percent 
Pass 
Fail 
Stop

jonesin1974 · (This post was last modified: Jan-04-2018, 12:40 PM by jonesin1974.)

@Larz60+ and @metulburr

I couldn't get either to work.

In Larz's solution, it complained about 'class'. When I typed 'cl', the closest keyword/method was classmethod which just dumped the whole table.

In metulburr's solution, it would only return the first table and not all of them as a list.

I was getting a bit closer, had a dirty solution where I could grab the whole table. Got that far and was then just going to split it using Verdana2 as a delimiter.

I so far had:

test1 = soup.find_all("table", attrs={"summary": "Transactions statistics summary table"})
 print (test1)

I've just tried Snippsat's solution and it works a treat so I'm going to leave mine for now as I doubt it was the proper way of doing it.

One thing I noticed is that Snippsat used a different parser than mine. I was using html.parser. I guess I should have made this clear, sorry. I didn't realise that it made a difference so likelihood is that I was trying something with the wrong parser...doh! Sorry about that.

This python is slowly killing me one day at a time.... I come from a C++ background and struggling as there seems to be many ways to skin a cat with this. I guess it will make sense in due course.

Thanks for your help!

Cheers,
J

(Jan-04-2018, 11:40 AM)snippsat Wrote:

from bs4 import BeautifulSoup

html = '''\
<table style=" border-collapse: collapse;" border="0" cellPadding="0" summary="Transactions statistics summary table" class="750WidthClass">
  <tr bgcolor="330066">
    <td id="LraTransaction Name&nbsp;" class="table_header_for_html_report" vAlign="top"><span class="Verdana2">Transaction Name&nbsp;</span></td>
    <td id="Status" class="table_header" vAlign="top" width="80"><span class="Verdana2">SLA Status</span></td>
    <td id="LraMinimum&nbsp;" class="table_header_for_html_report" vAlign="top"><span class="Verdana2">Minimum&nbsp;</span></td>
    <td id="LraAverage&nbsp;" class="table_header_for_html_report" vAlign="top"><span class="Verdana2">Average&nbsp;</span></td>
    <td id="LraMaximum&nbsp;" class="table_header_for_html_report" vAlign="top"><span class="Verdana2">Maximum&nbsp;</span></td>
    <td id="LraStd. Deviation&nbsp;" class="table_header_for_html_report" vAlign="top"><span class="Verdana2">Std. Deviation&nbsp;</span></td>
    <td id="Lra80 Percent&nbsp;" class="table_header_for_html_report" vAlign="top"><span class="Verdana2">80 Percent&nbsp;</span></td>
    <td id="LraPass&nbsp;" class="table_header_for_html_report" vAlign="top"><span class="Verdana2">Pass&nbsp;</span></td>
    <td id="LraFail&nbsp;" class="table_header_for_html_report" vAlign="top"><span class="Verdana2">Fail&nbsp;</span></td>
    <td id="LraStop&nbsp;" class="table_header_for_html_report" vAlign="top"><span class="Verdana2">Stop&nbsp;</span></td>
  </tr>
</table>'''

soup = BeautifulSoup(html, 'lxml')
table = soup.find(class_="750WidthClass")
verdana = table.select('.Verdana2')
for item in verdana:
    print(item.text)

Output:Transaction Name 
SLA Status
Minimum 
Average 
Maximum 
Std. Deviation 
80 Percent 
Pass 
Fail 
Stop

Thanks mate! Worked a treat!!! You've just saved what little hair I have left ha ha.

I really appreciate your help with this. I am struggling but getting there. Now, to go back to the documentation and fully understand your solution.

Many thanks!

J

NinoBaus · Apr-27-2018, 07:22 PM

(Jan-04-2018, 05:05 AM)metulburr Wrote:
Quote:There are several tables on the page but to uniquely identify the one above,
An ID is the only thing that can surely identify 100% from others. Sometimes you get lucky and the class name is the only one used in that tag you are searching for on that page, and sometimes you just have to pick the 4th table out from your results.
soup.find('table', {'class':'750WidthClass'})
but if the other tables have that same class, then you will need to get them all, then get the nth number of the tables with that class.
tables = soup.find_all('table', {'class':'750WidthClass'})
print(tables[3])
assuming it is the 4th table with that class

Thanks for this solution! I haven't figure out that same class name of tables go in to list :)

Best Regards from Serbia my friend :)

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	Trying to extract style attribute with BeautifulSoup	knight2000	1	3,080	Dec-28-2022, 03:06 AM Last Post: knight2000
	Extract data from a table	Bob_M	3	2,667	Aug-14-2020, 03:36 PM Last Post: Bob_M
	Extract data with Selenium and BeautifulSoup	nestor	3	3,909	Jun-06-2020, 01:34 AM Last Post: Larz60+
	Beautifulsoup table question	tantony	5	2,796	Sep-30-2019, 03:26 PM Last Post: tantony
	BeautifulSoup: Error while extracting a value from an HTML table	kawasso	3	3,217	Aug-25-2019, 01:13 AM Last Post: kawasso
	How to get hyperlinks in to the table extracted by BeautifulSoup	KenniT	2	4,933	Apr-04-2018, 10:05 AM Last Post: DeaD_EyE
	BeautifulSoup - Table	tkj80	6	9,757	Oct-21-2016, 01:23 AM Last Post: metulburr

BeautifulSoup - extract table but not using ID

User Panel Messages

Announcements