Python Forum
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Regular Expression
#1
Hi,

I have this code below:

Import re

Quote:text = """
ATT Please have a look at this building’s premium. It looks to be a very high rate. <img width="874" height="162" id="Picture_x0020_2" src="cid:[email protected]" alt="A picture containing text, screenshot, font, algebra

Description automatically generated" style="width:9.1in; height:1.6833in"> The client has a few policies with SEFATE as supporting business. 100000COUTINHO AC Family Trust (501 Missouri)StaHorse Sel: 000 4633353000e pos: [email protected] Lid van:Quanta Primary Ltd NSB Nr: 7777 Quanta Primary Ltd is an Authorised Financial Service Provider.
"""

pattern = re.compile(r'<img.+.+')
matches = pattern.finditer(text)

for match in matches:
    print(match)
Output:
<re.Match object; span=(90, 239), match='<img width="874" height="162" id="Picture_x0020_2>
I wan to get this output:

Output:
<img width="874" height="162" id="Picture_x0020_2" src="cid:[email protected]" alt="A picture containing text, screenshot, font, algebra Description automatically generated" style="width:9.1in; height:1.6833in">
Something is missing in my regular expression but I can't get it right.
Reply
#2
As per the documentation about match objects, try
print(match.group(0))
Reply
#3
Thank you, I do get the first part of the output, but the second part is still outstanding which is on the second line:
Output:
Description automatically generated" style="width:9.1in; height:1.6833in">
Reply
#4
(Jul-05-2023, 09:21 AM)stahorse Wrote: but the second part is still outstanding which is on the second line:
That's because '.' in regexes matches all caracters except the newline unless you pass the re.DOTALL flag
matches = pattern.finditer(text, re.DOTALL)
Reply
#5
pattern = re.compile(r'<img.+.+')
matches = pattern.finditer(text, re.DOTALL)

for match in matches:
    print(match.group(0))
Nothing still, I still get the same part of output, not all.

I'm getting:
Output:
<img width="874" height="162" id="Picture_x0020_2" src="cid:[email protected]" alt="A picture containing text, screenshot, font, algebra
And I'm looking for this:
Output:
<img width="874" height="162" id="Picture_x0020_2" src="cid:[email protected]" alt="A picture containing text, screenshot, font, algebra Description automatically generated" style="width:9.1in; height:1.6833in">
Reply
#6
Regex is not a good tool to parse HTML. I can't find the source, but someone mathematically proofed, that HTML is not parsable by regex.

You could use BeautifulSoup to parse HTML.
from bs4 import BeautifulSoup


def transform(element) -> dict[str, str | int]:
    """
    Transforms the attributes of the element.
    width and height are converted to int, if they exist.
    name of tag and stripped text of tag is also added to the dict.
    """
    attributes = element.attrs
    to_int = ("width", "height")
    for key in to_int:
        if key in attributes:
            attributes[key] = int(attributes[key])

    return attributes | {"text": element.text.strip(), "name": element.name}


def get_img(html) -> list[dict[str, str | int]]:
    return [
        transform(element)
        for element in BeautifulSoup(html, "html.parser").find_all("img")
    ]


text = """
ATT Please have a look at this building’s premium. It looks to be a very high rate. <img width="874" height="162" id="Picture_x0020_2" src="cid:[email protected]" alt="A picture containing text, screenshot, font, algebra

Description automatically generated" style="width:9.1in; height:1.6833in"> The client has a few policies with SEFATE as supporting business. 100000COUTINHO AC Family Trust (501 Missouri)StaHorse Sel: 000 4633353000e pos: [email protected] Lid van:Quanta Primary Ltd NSB Nr: 7777 Quanta Primary Ltd is an Authorised Financial Service Provider.


ATT Please have a look at this building’s premium. It looks to be a very high rate. <img width="666" height="666" id="Picture_x0020_2" src="cid:[email protected]" alt="A different alternative description" />

Test after
"""

result = get_img(text)



import pprint
pprint.pprint(result, indent=4)
Output:
[ { 'alt': 'A picture containing text, screenshot, font, algebra\n' '\n' 'Description automatically generated', 'height': 162, 'id': 'Picture_x0020_2', 'name': 'img', 'src': 'cid:[email protected]', 'style': 'width:9.1in; height:1.6833in', 'text': '', 'width': 874}, { 'alt': 'A different alternative description', 'height': 666, 'id': 'Picture_x0020_2', 'name': 'img', 'src': 'cid:[email protected]', 'text': 'Test after', 'width': 666}]
Larz60+ and Will_Robertson like this post
Almost dead, but too lazy to die: https://sourceserver.info
All humans together. We don't need politicians!
Reply
#7
Thank you for all the replies, I appreciate them, I will try BeautifulSoap too.

I got it by doing this:

pattern = re.compile(r'<(img.+.+)>', re.DOTALL)
matches = pattern.finditer(text)

for match in matches:
    print(match.group(0))
And this is the the output I get, which is correct.

Output:
<img width="874" height="162" id="Picture_x0020_2" src="cid:[email protected]" alt="A picture containing text, screenshot, font, algebra Description automatically generated" style="width:9.1in; height:1.6833in">
Reply
#8
Hi,

My code is now working:

pattern = re.compile(r'<(img.+.+)>', re.DOTALL)
matches = pattern.finditer(text)
 
for match in matches:
    print(match.group(0))
But I'm trying to replace the match.group() with '', but I get an error.

this is what I tried:
matches = pattern.finditer(new_file)
for match in matches:
  results = match.group().sub('', matches)
  print(results)
Reply
#9
Quote:My code is now working:
I don't think it is. Your pattern doesn't make sense. What do you think this means ".+.+"? Not only does the pattern not make sense, it is also wrong. Your pattern is greedy. The ".+.+" will match all characters between the first "<img" and the last ">". Look what happens when there are two matches in the text.
import re

text = """<img id="Picture_1"> <img id="Picture_2">"""
pattern = re.compile(r"<(img.+.+)>", re.DOTALL)

for match in pattern.finditer(text):
    print(match.group(0))
Output:
<img id="Picture_1"> <img id="Picture_2">
Only 1 match. If there were two matches the output would have two lines. Both of the "matches" were returned as a single match.

I think your pattern should be "<img.*?>". The "?" tells regex to match anything until the next ">", not the last.
import re

text = """<img id="Picture_1"> <img id="Picture_2">"""
pattern = re.compile(r"<(img.*?)>", re.DOTALL)

for match in pattern.finditer(text):
    print(match.group(0))
Output:
<img id="Picture_1"> <img id="Picture_2">
This returns two matches. Notice they are not printed on the same line like before.

What do you mean by this?
Quote:But I'm trying to replace the match.group() with '', but I get an error
Replace the match.group() where? In the match? In the file text?

If you want to replace <img whatever> with <>, you could do this:
text = re.sub("<img.*?>", "<>", new_file.read(), flags=re.DOTALL)
Reply
#10
(Jul-05-2023, 01:18 PM)DeaD_EyE Wrote: Regex is not a good tool to parse HTML. I can't find the source, but someone mathematically proofed, that HTML is not parsable by regex.

You could use BeautifulSoup to parse HTML.

Yes - I'd definitely agree - BeautifulSoup is much better for parsing HTML than RegEx - in theory and in practice attempting to use RegEx to parse HTML tends to get bogged down in a lot of problems.
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  regular expression pramod 1 1,558 Jul-10-2020, 06:38 AM
Last Post: karkas
  regular expression pramod 6 3,020 Jul-04-2020, 04:58 AM
Last Post: pramod
  regular expression pramod 4 2,098 Jun-16-2020, 02:01 AM
Last Post: pramod
  regular expression pramod 2 1,645 May-05-2020, 02:36 AM
Last Post: pramod

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020