Jul-05-2023, 01:18 PM
Regex is not a good tool to parse HTML. I can't find the source, but someone mathematically proofed, that HTML is not parsable by regex.
You could use BeautifulSoup to parse HTML.
You could use BeautifulSoup to parse HTML.
from bs4 import BeautifulSoup def transform(element) -> dict[str, str | int]: """ Transforms the attributes of the element. width and height are converted to int, if they exist. name of tag and stripped text of tag is also added to the dict. """ attributes = element.attrs to_int = ("width", "height") for key in to_int: if key in attributes: attributes[key] = int(attributes[key]) return attributes | {"text": element.text.strip(), "name": element.name} def get_img(html) -> list[dict[str, str | int]]: return [ transform(element) for element in BeautifulSoup(html, "html.parser").find_all("img") ] text = """ ATT Please have a look at this building’s premium. It looks to be a very high rate. <img width="874" height="162" id="Picture_x0020_2" src="cid:[email protected]" alt="A picture containing text, screenshot, font, algebra Description automatically generated" style="width:9.1in; height:1.6833in"> The client has a few policies with SEFATE as supporting business. 100000COUTINHO AC Family Trust (501 Missouri)StaHorse Sel: 000 4633353000e pos: [email protected] Lid van:Quanta Primary Ltd NSB Nr: 7777 Quanta Primary Ltd is an Authorised Financial Service Provider. ATT Please have a look at this building’s premium. It looks to be a very high rate. <img width="666" height="666" id="Picture_x0020_2" src="cid:[email protected]" alt="A different alternative description" /> Test after """ result = get_img(text) import pprint pprint.pprint(result, indent=4)
Output:[ { 'alt': 'A picture containing text, screenshot, font, algebra\n'
'\n'
'Description automatically generated',
'height': 162,
'id': 'Picture_x0020_2',
'name': 'img',
'src': 'cid:[email protected]',
'style': 'width:9.1in; height:1.6833in',
'text': '',
'width': 874},
{ 'alt': 'A different alternative description',
'height': 666,
'id': 'Picture_x0020_2',
'name': 'img',
'src': 'cid:[email protected]',
'text': 'Test after',
'width': 666}]
Almost dead, but too lazy to die: https://sourceserver.info
All humans together. We don't need politicians!
All humans together. We don't need politicians!