Regular Expression

DeaD_EyE · Jul-05-2023, 01:18 PM

Regex is not a good tool to parse HTML. I can't find the source, but someone mathematically proofed, that HTML is not parsable by regex.

You could use BeautifulSoup to parse HTML.

from bs4 import BeautifulSoup


def transform(element) -> dict[str, str | int]:
    """
    Transforms the attributes of the element.
    width and height are converted to int, if they exist.
    name of tag and stripped text of tag is also added to the dict.
    """
    attributes = element.attrs
    to_int = ("width", "height")
    for key in to_int:
        if key in attributes:
            attributes[key] = int(attributes[key])

    return attributes | {"text": element.text.strip(), "name": element.name}


def get_img(html) -> list[dict[str, str | int]]:
    return [
        transform(element)
        for element in BeautifulSoup(html, "html.parser").find_all("img")
    ]


text = """
ATT Please have a look at this building’s premium. It looks to be a very high rate. <img width="874" height="162" id="Picture_x0020_2" src="cid:[email protected]" alt="A picture containing text, screenshot, font, algebra

Description automatically generated" style="width:9.1in; height:1.6833in"> The client has a few policies with SEFATE as supporting business. 100000COUTINHO AC Family Trust (501 Missouri)StaHorse Sel: 000 4633353000e pos: [email protected] Lid van:Quanta Primary Ltd NSB Nr: 7777 Quanta Primary Ltd is an Authorised Financial Service Provider.


ATT Please have a look at this building’s premium. It looks to be a very high rate. <img width="666" height="666" id="Picture_x0020_2" src="cid:[email protected]" alt="A different alternative description" />

Test after
"""

result = get_img(text)



import pprint
pprint.pprint(result, indent=4)

Output:[   {   'alt': 'A picture containing text, screenshot, font, algebra\n'
               '\n'
               'Description automatically generated',
        'height': 162,
        'id': 'Picture_x0020_2',
        'name': 'img',
        'src': 'cid:[email protected]',
        'style': 'width:9.1in; height:1.6833in',
        'text': '',
        'width': 874},
    {   'alt': 'A different alternative description',
        'height': 666,
        'id': 'Picture_x0020_2',
        'name': 'img',
        'src': 'cid:[email protected]',
        'text': 'Test after',
        'width': 666}]

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	regular expression	pramod	1	2,063	Jul-10-2020, 06:38 AM Last Post: karkas
	regular expression	pramod	6	4,158	Jul-04-2020, 04:58 AM Last Post: pramod
	regular expression	pramod	4	2,966	Jun-16-2020, 02:01 AM Last Post: pramod
	regular expression	pramod	2	2,291	May-05-2020, 02:36 AM Last Post: pramod

Regular Expression

User Panel Messages

Announcements