Python Forum
Easy HTML Parser: Validating trs by attributes several tags deep?
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Easy HTML Parser: Validating trs by attributes several tags deep?
#1
I have used Easy HTML Parser in a JSON to select the table I require the rows from with:

"row": "find_once('table', ('class', 'forum_hdr_bord'), order=3).find_all('tr')"
Sometimes this table can contain the wrong rows, but if I can check the attributes in the anchor tags against a {variable} I can validate the rows.

I have been looking at the various options which return parents, like:

find_with_root(name, *args)
take_with_root(*args)
match_with_root(*args)
but the issue is that I need to reach from the table through the tr tags, the td tags, and then the anchor tags to check the attribute, and at the end just return the rows.
Is this kind of validation possible?
Reply
#2
what is the URL?
Reply
#3
(Aug-13-2020, 08:38 PM)Larz60+ Wrote: what is the URL?

I hope you can understand, but it is actually for something sensitive which I cannot share the URL for, but what I can do is share an edited bin of the table, with the first row (there are many tables and many rows in each) https://del.dog/unamogruna.txt

Also, I am actually able to do this when I tested with BeautifulSoup earlier today, the only issue is that because I am editing someone else's code, I have to use Easy HTML Parser https://pydoc.net/ehp/2.0.1/ehp/. The BeautifulSoup code that works is:

page_soup.select('table')[5].select('tr:contains("The String that I have Searched for")')
Reply
#4
Here's a bit of a parser that uses BeautifulSoup to get you started:
needs polishing.
from bs4 import BeautifulSoup

data = """
<table align="center" border="0" cellpadding="0" cellspacing="0" class="forum_header_border" width="950">
    <tr>
        <td class="section_post_header" colspan="12">
            <h1 style="display: inline;"><u>The String I have Searched for </u> Statictext1 </h1> -
            <h2 style="display: inline;"><i>Statictext2; Statictext7 The String I have Searched for Statictext3</i></h2>
        </td>
    </tr>
    <tr>
        <td class="forum_thread_header" title="Search Information" width="35">Show</td>
        <td class="forum_thread_header" style="text-align: left; padding-left: 10px;">Item Name</td>
        <td class="forum_thread_header">Column3</td>
        <td class="forum_thread_header">Column4</td>
        <td class="forum_thread_header">Column5</td>
        <td class="forum_thread_header_end">Column6</td>
    </tr>
    <tr class="forum_header_border" name="hover">
        <td align="center" class="forum_thread_post" width="35">
            <a href="/searches/103304/the-string-i-have-searched-for/" title="The String I have Searched for Statictext4"><img alt="Info" border="0" src="/images/sdfsdf_sdfdsfs_info3.png" title="The String I have searched for Statictext5" /></a>
        </td>
        <td class="forum_thread_post">
            <a alt="The String I have Searched for d1f4 [website] (50 MB)" class="searchinfo" href="/si/146wew1729/the-string-i-have-searched-for-sdfsdfs-asdad/" title="The String I have Searched for d1f4 [website] (50 MB)">The String I have Searched for d1f4 [website]</a>
        </td>
        <td align="center" class="forum_thread_post">
            <a class="customlink" href="https://sdfsdfs"></a>
        </td>
        <td align="center" class="forum_thread_post">50 MB</td>
        <td align="center" class="forum_thread_post">1 mo</td>
        <td align="center" class="forum_thread_post_end">
            <font color="green">6</font>
        </td>
    </tr>
"""

# This code is not my own, but can't remember where I found it
def prettify(soup, indent):
    pretty_soup = str()
    previous_indent = 0
    for line in soup.prettify().split("\n"):
        current_indent = str(line).find("<")
        if current_indent == -1 or current_indent > previous_indent + 2:
            current_indent = previous_indent + 1
        previous_indent = current_indent
        pretty_soup += write_new_line(line, current_indent, indent)
    return pretty_soup

def write_new_line(line, current_indent, desired_indent):
    new_line = ""
    spaces_to_add = (current_indent * desired_indent) - current_indent
    if spaces_to_add > 0:
        for i in range(spaces_to_add):
            new_line += " "		
    new_line += str(line) + "\n"
    return new_line

def parse_html():
    soup = BeautifulSoup(data, 'lxml')
    trs = soup.find_all('tr')
    for n, tr in enumerate(trs):
        tds = tr.find_all('td')
        for n1, td in enumerate(tds):
            # print(f"/n---------------------- tr{n}, td{n1} ----------------------")
            # print(f"{prettify(td, 2)}")
            if td.a:
                link = td.a.get('href')
                title = td.a.text.strip()
                print(f"{title}: {link}")
            elif td.h1:
                print(f"h1: {td.h1.text.strip()}")
            elif td.h2:
                print(f"h2: {td.h2.text.strip()}")
            elif td.font:
                print(f"font: {td.font.text.strip()}")
            else:
                print(f"td text: {td.text.strip()}")
            

parse_html()
Produces:
Output:
h1: The String I have Searched for Statictext1 td text: Show td text: Item Name td text: Column3 td text: Column4 td text: Column5 td text: Column6 : /searches/103304/the-string-i-have-searched-for/ The String I have Searched for d1f4 [website]: /si/146wew1729/the-string-i-have-searched-for-sdfsdfs-asdad/ : https://sdfsdfs td text: 50 MB td text: 1 mo font: 6h1: The String I have Searched for Statictext1 td text: Show td text: Item Name td text: Column3 td text: Column4 td text: Column5 td text: Column6 : /searches/103304/the-string-i-have-searched-for/ The String I have Searched for d1f4 [website]: /si/146wew1729/the-string-i-have-searched-for-sdfsdfs-asdad/ : https://sdfsdfs td text: 50 MB td text: 1 mo font: 6
Reply
#5
(Aug-14-2020, 02:30 AM)Larz60+ Wrote: Here's a bit of a parser that uses BeautifulSoup to get you started:
needs polishing.
from bs4 import BeautifulSoup

data = """
<table align="center" border="0" cellpadding="0" cellspacing="0" class="forum_header_border" width="950">
    <tr>
        <td class="section_post_header" colspan="12">
            <h1 style="display: inline;"><u>The String I have Searched for </u> Statictext1 </h1> -
            <h2 style="display: inline;"><i>Statictext2; Statictext7 The String I have Searched for Statictext3</i></h2>
        </td>
    </tr>
    <tr>
        <td class="forum_thread_header" title="Search Information" width="35">Show</td>
        <td class="forum_thread_header" style="text-align: left; padding-left: 10px;">Item Name</td>
        <td class="forum_thread_header">Column3</td>
        <td class="forum_thread_header">Column4</td>
        <td class="forum_thread_header">Column5</td>
        <td class="forum_thread_header_end">Column6</td>
    </tr>
    <tr class="forum_header_border" name="hover">
        <td align="center" class="forum_thread_post" width="35">
            <a href="/searches/103304/the-string-i-have-searched-for/" title="The String I have Searched for Statictext4"><img alt="Info" border="0" src="/images/sdfsdf_sdfdsfs_info3.png" title="The String I have searched for Statictext5" /></a>
        </td>
        <td class="forum_thread_post">
            <a alt="The String I have Searched for d1f4 [website] (50 MB)" class="searchinfo" href="/si/146wew1729/the-string-i-have-searched-for-sdfsdfs-asdad/" title="The String I have Searched for d1f4 [website] (50 MB)">The String I have Searched for d1f4 [website]</a>
        </td>
        <td align="center" class="forum_thread_post">
            <a class="customlink" href="https://sdfsdfs"></a>
        </td>
        <td align="center" class="forum_thread_post">50 MB</td>
        <td align="center" class="forum_thread_post">1 mo</td>
        <td align="center" class="forum_thread_post_end">
            <font color="green">6</font>
        </td>
    </tr>
"""

# This code is not my own, but can't remember where I found it
def prettify(soup, indent):
    pretty_soup = str()
    previous_indent = 0
    for line in soup.prettify().split("\n"):
        current_indent = str(line).find("<")
        if current_indent == -1 or current_indent > previous_indent + 2:
            current_indent = previous_indent + 1
        previous_indent = current_indent
        pretty_soup += write_new_line(line, current_indent, indent)
    return pretty_soup

def write_new_line(line, current_indent, desired_indent):
    new_line = ""
    spaces_to_add = (current_indent * desired_indent) - current_indent
    if spaces_to_add > 0:
        for i in range(spaces_to_add):
            new_line += " "		
    new_line += str(line) + "\n"
    return new_line

def parse_html():
    soup = BeautifulSoup(data, 'lxml')
    trs = soup.find_all('tr')
    for n, tr in enumerate(trs):
        tds = tr.find_all('td')
        for n1, td in enumerate(tds):
            # print(f"/n---------------------- tr{n}, td{n1} ----------------------")
            # print(f"{prettify(td, 2)}")
            if td.a:
                link = td.a.get('href')
                title = td.a.text.strip()
                print(f"{title}: {link}")
            elif td.h1:
                print(f"h1: {td.h1.text.strip()}")
            elif td.h2:
                print(f"h2: {td.h2.text.strip()}")
            elif td.font:
                print(f"font: {td.font.text.strip()}")
            else:
                print(f"td text: {td.text.strip()}")
            

parse_html()
Produces:
Output:
h1: The String I have Searched for Statictext1 td text: Show td text: Item Name td text: Column3 td text: Column4 td text: Column5 td text: Column6 : /searches/103304/the-string-i-have-searched-for/ The String I have Searched for d1f4 [website]: /si/146wew1729/the-string-i-have-searched-for-sdfsdfs-asdad/ : https://sdfsdfs td text: 50 MB td text: 1 mo font: 6h1: The String I have Searched for Statictext1 td text: Show td text: Item Name td text: Column3 td text: Column4 td text: Column5 td text: Column6 : /searches/103304/the-string-i-have-searched-for/ The String I have Searched for d1f4 [website]: /si/146wew1729/the-string-i-have-searched-for-sdfsdfs-asdad/ : https://sdfsdfs td text: 50 MB td text: 1 mo font: 6

Thank you, I appreciate that. I find BeautifulSoup very cool, and I was actually following some tutorials on it before I realised yesterday that the code I am editing uses Easy HTML Parser instead of BeautifulSoup. This is just my third day of Python, so I should probably learn the basics before jumping into web scraping, but I was able to get this broken scraper to work - the only issue being that the table displays random results when there are none.

I was hoping that with a slight edit to the line I posted already, I could make it do something similar to the single line of BeautifulSoup code I quoted does. As you responded with a solution of many lines of code using BeautifulSoup, does this mean that I am wasting my time trying to do it with a concise single line of Easy HTML Parser code?
Reply
#6
single line code in my opinion is good if it's easily understandable and efficient.
Otherwise, I personally would avoid it. The solution that you show, although it looks quite simple,
is already causing am implementation delay. Just keep that in mind.
Reply
#7
(Aug-14-2020, 08:43 AM)Larz60+ Wrote: single line code in my opinion is good if it's easily understandable and efficient.
Otherwise, I personally would avoid it. The solution that you show, although it looks quite simple,
is already causing am implementation delay. Just keep that in mind.

Noted thank you. I think I do understand what the BS line I linked is doing, it selects the table, and then checks all attributes to see if any contain the text, I understand that is not efficient as really it should only be checking specific attributes, in fact it would be most efficient if in fact it just checked the attribute of one row - because the way this table works, if one row is valid, all of the rows will be valid.

With that in mind, considering the Easy HTML Parser documentation I linked that explains its version of methods like find_all etc, would you have any idea how I can edit the current line, to output rows after validating an anchor attribute within a cell of a single or multiple rows?

I think need something like, and this is wrong because I am either using the wrong method, wrong syntax, or the wrong order - probably all three:

"row": "find_with_root(('a', ('title', 'The String I have Searched for')).find_with_root('td').find_all('tr')"

This is a grep of the "row" lines from the JSON for other scrapers. I have gone through each one, trying to change them to fit what I want to do but been unsuccessful. Perhaps they would make more sense to you?

https://hastebin.com/esuheweguz.sql
Reply
#8
After looking at a non-zero line for another scraper:
"infohash": "item.take_with_root(('src', 'http://img.abcdefg.com/pic/magnet-icon-12w-12h.gif'))[0].attr['href'].replace('protocol:?xt=urn:btih:)[0].attr['href'].replace('protocol:?xt=urn:btih:)', '') if item.take_with_root(('src', 'http://img.abcdefg.com/pic/protocol-icon-49sw-12h.gif')) else ''",
I have been experimenting with conditionals:

"row": "find_once('table', ('class', 'forum_header_border'), order=3).find_all('tr') if 1<2 else ''",
And it functions as normal depending on the condition. If I can now just make this condition be whether or not the string input to search for which is the value of:

/html/body/div[1]/table[1]/tbody/tr/td/form/div[1]/input
matches the same number of characters from the beginning of my table:
/html/body/div[1]/table[5]/tbody/tr[3]/td[2]/a
Is that something you are familiar with?

@Mod: Can you please edit my post as both gifs in the first line of code should read 'protocol-icon-49sw-12h.gif'

And infohash should read 'anotherline'.
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
Question Python Obstacles | Jeet-Kune-Do | BS4 (Tags > MariaDB) [URL/Local HTML] BrandonKastning 0 1,400 Feb-08-2022, 08:55 PM
Last Post: BrandonKastning
  HTML multi select HTML listbox with Flask/Python rfeyer 0 4,536 Mar-14-2021, 12:23 PM
Last Post: rfeyer
  Any way to remove HTML tags from scraped data? (I want text only) SeBz2020uk 1 3,414 Nov-02-2020, 08:12 PM
Last Post: Larz60+
  Jinja2 HTML <a> tags not rendering properly ChaitanyaPy 4 3,185 Jun-28-2020, 06:12 PM
Last Post: ChaitanyaPy
  Extracting html data using attributes WiPi 14 5,337 May-04-2020, 02:04 PM
Last Post: snippsat
  Python3 + BeautifulSoup4 + lxml (HTML -> CSV) - How to loop to next HTML/new CSV Row BrandonKastning 0 2,329 Mar-22-2020, 06:10 AM
Last Post: BrandonKastning
  How do I get rid of the HTML tags in my output? glittergirl 1 3,694 Aug-05-2019, 08:30 PM
Last Post: snippsat
  Beutifulsoup: how to pick text that's not in HTML tags? pitonas 4 4,646 Oct-08-2018, 01:43 PM
Last Post: pitonas
  html parser tjnichols 9 31,056 Mar-17-2018, 11:00 PM
Last Post: tjnichols
  How to read html tags dynamically generated? amandacstr 5 7,555 Mar-05-2018, 06:07 AM
Last Post: snippsat

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020