Python Forum
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Regular Expression
#6
Regex is not a good tool to parse HTML. I can't find the source, but someone mathematically proofed, that HTML is not parsable by regex.

You could use BeautifulSoup to parse HTML.
from bs4 import BeautifulSoup


def transform(element) -> dict[str, str | int]:
    """
    Transforms the attributes of the element.
    width and height are converted to int, if they exist.
    name of tag and stripped text of tag is also added to the dict.
    """
    attributes = element.attrs
    to_int = ("width", "height")
    for key in to_int:
        if key in attributes:
            attributes[key] = int(attributes[key])

    return attributes | {"text": element.text.strip(), "name": element.name}


def get_img(html) -> list[dict[str, str | int]]:
    return [
        transform(element)
        for element in BeautifulSoup(html, "html.parser").find_all("img")
    ]


text = """
ATT Please have a look at this building’s premium. It looks to be a very high rate. <img width="874" height="162" id="Picture_x0020_2" src="cid:[email protected]" alt="A picture containing text, screenshot, font, algebra

Description automatically generated" style="width:9.1in; height:1.6833in"> The client has a few policies with SEFATE as supporting business. 100000COUTINHO AC Family Trust (501 Missouri)StaHorse Sel: 000 4633353000e pos: [email protected] Lid van:Quanta Primary Ltd NSB Nr: 7777 Quanta Primary Ltd is an Authorised Financial Service Provider.


ATT Please have a look at this building’s premium. It looks to be a very high rate. <img width="666" height="666" id="Picture_x0020_2" src="cid:[email protected]" alt="A different alternative description" />

Test after
"""

result = get_img(text)



import pprint
pprint.pprint(result, indent=4)
Output:
[ { 'alt': 'A picture containing text, screenshot, font, algebra\n' '\n' 'Description automatically generated', 'height': 162, 'id': 'Picture_x0020_2', 'name': 'img', 'src': 'cid:[email protected]', 'style': 'width:9.1in; height:1.6833in', 'text': '', 'width': 874}, { 'alt': 'A different alternative description', 'height': 666, 'id': 'Picture_x0020_2', 'name': 'img', 'src': 'cid:[email protected]', 'text': 'Test after', 'width': 666}]
Will_Robertson and Larz60+ like this post
Almost dead, but too lazy to die: https://sourceserver.info
All humans together. We don't need politicians!
Reply


Messages In This Thread
Regular Expression - by stahorse - Jul-05-2023, 06:28 AM
RE: Regular Expression - by Gribouillis - Jul-05-2023, 07:46 AM
RE: Regular Expression - by stahorse - Jul-05-2023, 09:21 AM
RE: Regular Expression - by Gribouillis - Jul-05-2023, 11:51 AM
RE: Regular Expression - by stahorse - Jul-05-2023, 12:10 PM
RE: Regular Expression - by DeaD_EyE - Jul-05-2023, 01:18 PM
RE: Regular Expression - by Will_Robertson - Jul-31-2023, 01:20 PM
RE: Regular Expression - by stahorse - Jul-05-2023, 02:46 PM
RE: Regular Expression - by stahorse - Jul-11-2023, 10:12 AM
RE: Regular Expression - by deanhystad - Jul-12-2023, 11:35 AM

Possibly Related Threads…
Thread Author Replies Views Last Post
  regular expression pramod 1 2,063 Jul-10-2020, 06:38 AM
Last Post: karkas
  regular expression pramod 6 4,158 Jul-04-2020, 04:58 AM
Last Post: pramod
  regular expression pramod 4 2,966 Jun-16-2020, 02:01 AM
Last Post: pramod
  regular expression pramod 2 2,291 May-05-2020, 02:36 AM
Last Post: pramod

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020