Regular Expression

stahorse · Jul-05-2023, 06:28 AM

Hi,

I have this code below:

Import re

Quote:text = """
ATT Please have a look at this building’s premium. It looks to be a very high rate. <img width="874" height="162" id="Picture_x0020_2" src="cid:[email protected]" alt="A picture containing text, screenshot, font, algebra

Description automatically generated" style="width:9.1in; height:1.6833in"> The client has a few policies with SEFATE as supporting business. 100000COUTINHO AC Family Trust (501 Missouri)StaHorse Sel: 000 4633353000e pos: [email protected] Lid van:Quanta Primary Ltd NSB Nr: 7777 Quanta Primary Ltd is an Authorised Financial Service Provider.
"""

pattern = re.compile(r'<img.+.+')
matches = pattern.finditer(text)

for match in matches:
    print(match)

Output:
<re.Match object; span=(90, 239), match='<img width="874" height="162" id="Picture_x0020_2>

I wan to get this output:

Output:
<img width="874" height="162" id="Picture_x0020_2" src="cid:[email protected]" alt="A picture containing text, screenshot, font, algebra

Description automatically generated" style="width:9.1in; height:1.6833in">

Something is missing in my regular expression but I can't get it right.

**Gribouillis** · Jul-05-2023, 07:46 AM

As per the documentation about match objects, try

print(match.group(0))

stahorse · Jul-05-2023, 09:21 AM

Thank you, I do get the first part of the output, but the second part is still outstanding which is on the second line:

Output:
Description automatically generated" style="width:9.1in; height:1.6833in">

**Gribouillis** · Jul-05-2023, 11:51 AM

(Jul-05-2023, 09:21 AM)stahorse Wrote: but the second part is still outstanding which is on the second line:

That's because '.' in regexes matches all caracters except the newline unless you pass the re.DOTALL flag

matches = pattern.finditer(text, re.DOTALL)

stahorse · Jul-05-2023, 12:10 PM

pattern = re.compile(r'<img.+.+')
matches = pattern.finditer(text, re.DOTALL)

for match in matches:
    print(match.group(0))

Nothing still, I still get the same part of output, not all.

I'm getting:

Output:
<img width="874" height="162" id="Picture_x0020_2" src="cid:[email protected]" alt="A picture containing text, screenshot, font, algebra

And I'm looking for this:

Output:
<img width="874" height="162" id="Picture_x0020_2" src="cid:[email protected]" alt="A picture containing text, screenshot, font, algebra

Description automatically generated" style="width:9.1in; height:1.6833in">

DeaD_EyE · Jul-05-2023, 01:18 PM

Regex is not a good tool to parse HTML. I can't find the source, but someone mathematically proofed, that HTML is not parsable by regex.

You could use BeautifulSoup to parse HTML.

from bs4 import BeautifulSoup


def transform(element) -> dict[str, str | int]:
    """
    Transforms the attributes of the element.
    width and height are converted to int, if they exist.
    name of tag and stripped text of tag is also added to the dict.
    """
    attributes = element.attrs
    to_int = ("width", "height")
    for key in to_int:
        if key in attributes:
            attributes[key] = int(attributes[key])

    return attributes | {"text": element.text.strip(), "name": element.name}


def get_img(html) -> list[dict[str, str | int]]:
    return [
        transform(element)
        for element in BeautifulSoup(html, "html.parser").find_all("img")
    ]


text = """
ATT Please have a look at this building’s premium. It looks to be a very high rate. <img width="874" height="162" id="Picture_x0020_2" src="cid:[email protected]" alt="A picture containing text, screenshot, font, algebra

Description automatically generated" style="width:9.1in; height:1.6833in"> The client has a few policies with SEFATE as supporting business. 100000COUTINHO AC Family Trust (501 Missouri)StaHorse Sel: 000 4633353000e pos: [email protected] Lid van:Quanta Primary Ltd NSB Nr: 7777 Quanta Primary Ltd is an Authorised Financial Service Provider.


ATT Please have a look at this building’s premium. It looks to be a very high rate. <img width="666" height="666" id="Picture_x0020_2" src="cid:[email protected]" alt="A different alternative description" />

Test after
"""

result = get_img(text)



import pprint
pprint.pprint(result, indent=4)

Output:[   {   'alt': 'A picture containing text, screenshot, font, algebra\n'
               '\n'
               'Description automatically generated',
        'height': 162,
        'id': 'Picture_x0020_2',
        'name': 'img',
        'src': 'cid:[email protected]',
        'style': 'width:9.1in; height:1.6833in',
        'text': '',
        'width': 874},
    {   'alt': 'A different alternative description',
        'height': 666,
        'id': 'Picture_x0020_2',
        'name': 'img',
        'src': 'cid:[email protected]',
        'text': 'Test after',
        'width': 666}]

stahorse · (This post was last modified: Jul-05-2023, 02:46 PM by stahorse.)

Thank you for all the replies, I appreciate them, I will try BeautifulSoap too.

I got it by doing this:

pattern = re.compile(r'<(img.+.+)>', re.DOTALL)
matches = pattern.finditer(text)

for match in matches:
    print(match.group(0))

And this is the the output I get, which is correct.

Output:
<img width="874" height="162" id="Picture_x0020_2" src="cid:[email protected]" alt="A picture containing text, screenshot, font, algebra

Description automatically generated" style="width:9.1in; height:1.6833in">

stahorse · Jul-11-2023, 10:12 AM

Hi,

My code is now working:

pattern = re.compile(r'<(img.+.+)>', re.DOTALL)
matches = pattern.finditer(text)
 
for match in matches:
    print(match.group(0))

But I'm trying to replace the match.group() with '', but I get an error.

this is what I tried:

matches = pattern.finditer(new_file)
for match in matches:
  results = match.group().sub('', matches)
  print(results)

**deanhystad** · (This post was last modified: Jul-12-2023, 03:04 PM by deanhystad.)

Quote:My code is now working:

I don't think it is. Your pattern doesn't make sense. What do you think this means ".+.+"? Not only does the pattern not make sense, it is also wrong. Your pattern is greedy. The ".+.+" will match all characters between the first "<img" and the last ">". Look what happens when there are two matches in the text.

import re

text = """<img id="Picture_1"> <img id="Picture_2">"""
pattern = re.compile(r"<(img.+.+)>", re.DOTALL)

for match in pattern.finditer(text):
    print(match.group(0))

Output:
<img id="Picture_1"> <img id="Picture_2">

Only 1 match. If there were two matches the output would have two lines. Both of the "matches" were returned as a single match.

I think your pattern should be "<img.*?>". The "?" tells regex to match anything until the next ">", not the last.

import re

text = """<img id="Picture_1"> <img id="Picture_2">"""
pattern = re.compile(r"<(img.*?)>", re.DOTALL)

for match in pattern.finditer(text):
    print(match.group(0))

Output:<img id="Picture_1">
<img id="Picture_2">

This returns two matches. Notice they are not printed on the same line like before.

What do you mean by this?

Quote:But I'm trying to replace the match.group() with '', but I get an error

Replace the match.group() where? In the match? In the file text?

If you want to replace <img whatever> with <>, you could do this:

text = re.sub("<img.*?>", "<>", new_file.read(), flags=re.DOTALL)

Will_Robertson · Jul-31-2023, 01:20 PM

(Jul-05-2023, 01:18 PM)DeaD_EyE Wrote: Regex is not a good tool to parse HTML. I can't find the source, but someone mathematically proofed, that HTML is not parsable by regex.

You could use BeautifulSoup to parse HTML.

Yes - I'd definitely agree - BeautifulSoup is much better for parsing HTML than RegEx - in theory and in practice attempting to use RegEx to parse HTML tends to get bogged down in a lot of problems.

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	regular expression	pramod	1	1,558	Jul-10-2020, 06:38 AM Last Post: karkas
	regular expression	pramod	6	3,020	Jul-04-2020, 04:58 AM Last Post: pramod
	regular expression	pramod	4	2,098	Jun-16-2020, 02:01 AM Last Post: pramod
	regular expression	pramod	2	1,645	May-05-2020, 02:36 AM Last Post: pramod

Regular Expression

User Panel Messages

Announcements