Easy HTML Parser: Validating trs by attributes several tags deep?

**Larz60+** · Aug-14-2020, 02:30 AM

Here's a bit of a parser that uses BeautifulSoup to get you started:
needs polishing.

from bs4 import BeautifulSoup

data = """
<table align="center" border="0" cellpadding="0" cellspacing="0" class="forum_header_border" width="950">
    <tr>
        <td class="section_post_header" colspan="12">
            <h1 style="display: inline;"><u>The String I have Searched for </u> Statictext1 </h1> -
            <h2 style="display: inline;"><i>Statictext2; Statictext7 The String I have Searched for Statictext3</i></h2>
        </td>
    </tr>
    <tr>
        <td class="forum_thread_header" title="Search Information" width="35">Show</td>
        <td class="forum_thread_header" style="text-align: left; padding-left: 10px;">Item Name</td>
        <td class="forum_thread_header">Column3</td>
        <td class="forum_thread_header">Column4</td>
        <td class="forum_thread_header">Column5</td>
        <td class="forum_thread_header_end">Column6</td>
    </tr>
    <tr class="forum_header_border" name="hover">
        <td align="center" class="forum_thread_post" width="35">
            <a href="/searches/103304/the-string-i-have-searched-for/" title="The String I have Searched for Statictext4"><img alt="Info" border="0" src="/images/sdfsdf_sdfdsfs_info3.png" title="The String I have searched for Statictext5" /></a>
        </td>
        <td class="forum_thread_post">
            <a alt="The String I have Searched for d1f4 [website] (50 MB)" class="searchinfo" href="/si/146wew1729/the-string-i-have-searched-for-sdfsdfs-asdad/" title="The String I have Searched for d1f4 [website] (50 MB)">The String I have Searched for d1f4 [website]</a>
        </td>
        <td align="center" class="forum_thread_post">
            <a class="customlink" href="https://sdfsdfs"></a>
        </td>
        <td align="center" class="forum_thread_post">50 MB</td>
        <td align="center" class="forum_thread_post">1 mo</td>
        <td align="center" class="forum_thread_post_end">
            <font color="green">6</font>
        </td>
    </tr>
"""

# This code is not my own, but can't remember where I found it
def prettify(soup, indent):
    pretty_soup = str()
    previous_indent = 0
    for line in soup.prettify().split("\n"):
        current_indent = str(line).find("<")
        if current_indent == -1 or current_indent > previous_indent + 2:
            current_indent = previous_indent + 1
        previous_indent = current_indent
        pretty_soup += write_new_line(line, current_indent, indent)
    return pretty_soup

def write_new_line(line, current_indent, desired_indent):
    new_line = ""
    spaces_to_add = (current_indent * desired_indent) - current_indent
    if spaces_to_add > 0:
        for i in range(spaces_to_add):
            new_line += " "		
    new_line += str(line) + "\n"
    return new_line

def parse_html():
    soup = BeautifulSoup(data, 'lxml')
    trs = soup.find_all('tr')
    for n, tr in enumerate(trs):
        tds = tr.find_all('td')
        for n1, td in enumerate(tds):
            # print(f"/n---------------------- tr{n}, td{n1} ----------------------")
            # print(f"{prettify(td, 2)}")
            if td.a:
                link = td.a.get('href')
                title = td.a.text.strip()
                print(f"{title}: {link}")
            elif td.h1:
                print(f"h1: {td.h1.text.strip()}")
            elif td.h2:
                print(f"h2: {td.h2.text.strip()}")
            elif td.font:
                print(f"font: {td.font.text.strip()}")
            else:
                print(f"td text: {td.text.strip()}")
            

parse_html()

Produces:

Output:h1: The String I have Searched for  Statictext1
td text: Show
td text: Item Name
td text: Column3
td text: Column4
td text: Column5
td text: Column6
: /searches/103304/the-string-i-have-searched-for/
The String I have Searched for d1f4 [website]: /si/146wew1729/the-string-i-have-searched-for-sdfsdfs-asdad/
: https://sdfsdfs
td text: 50 MB
td text: 1 mo
font: 6h1: The String I have Searched for  Statictext1
td text: Show
td text: Item Name
td text: Column3
td text: Column4
td text: Column5
td text: Column6
: /searches/103304/the-string-i-have-searched-for/
The String I have Searched for d1f4 [website]: /si/146wew1729/the-string-i-have-searched-for-sdfsdfs-asdad/
: https://sdfsdfs
td text: 50 MB
td text: 1 mo
font: 6

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	Python Obstacles \| Jeet-Kune-Do \| BS4 (Tags > MariaDB) [URL/Local HTML]	BrandonKastning	0	1,470	Feb-08-2022, 08:55 PM Last Post: BrandonKastning
	HTML multi select HTML listbox with Flask/Python	rfeyer	0	4,825	Mar-14-2021, 12:23 PM Last Post: rfeyer
	Any way to remove HTML tags from scraped data? (I want text only)	SeBz2020uk	1	3,575	Nov-02-2020, 08:12 PM Last Post: Larz60+
	Jinja2 HTML <a> tags not rendering properly	ChaitanyaPy	4	3,363	Jun-28-2020, 06:12 PM Last Post: ChaitanyaPy
	Extracting html data using attributes	WiPi	14	5,705	May-04-2020, 02:04 PM Last Post: snippsat
	Python3 + BeautifulSoup4 + lxml (HTML -> CSV) - How to loop to next HTML/new CSV Row	BrandonKastning	0	2,444	Mar-22-2020, 06:10 AM Last Post: BrandonKastning
	How do I get rid of the HTML tags in my output?	glittergirl	1	3,806	Aug-05-2019, 08:30 PM Last Post: snippsat
	Beutifulsoup: how to pick text that's not in HTML tags?	pitonas	4	4,838	Oct-08-2018, 01:43 PM Last Post: pitonas
	html parser	tjnichols	9	44,427	Mar-17-2018, 11:00 PM Last Post: tjnichols
	How to read html tags dynamically generated?	amandacstr	5	7,708	Mar-05-2018, 06:07 AM Last Post: snippsat

Easy HTML Parser: Validating trs by attributes several tags deep?

User Panel Messages

Announcements