Python Forum

Full Version: Regex: a string does not starts and ends with the same character
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
hello. I have a little problem with Python and Regex. In fact, some regex doesn't work if a string does not starts and ends with the same character. For example I have this html tags:

pattern1 = r'<p class="text_obisnuit">.*((bebe|oana|mother|sun).*){3,}.*</p>'
pattern2 = r'<p class="text_obisnuit2">.*((bebe|oana|mother|sun).*){3,}.*</p>'
pattern3 = r'<title>.*((bebe|oana|mother|sun).*){3,}.*</title>'
pattern4 = r'<meta name="description" content=.*((bebe|oana|mother|sun).*){3,}.*>'
The first 3 tags are working, because a string is starts and ends with the same character.

But the 4 tag, the one with meta description, is not working. Python cannot find anything with the regex.
What are trying do here?
Regex and HTML are not best friends the famous post.
So that's why parsers(BS, lxml) exist to deal with this.
yes, I am using BS

from bs4 import BeautifulSoup
from bs4.formatter import HTMLFormatter
from googletrans import Translator
import requests
import re
But the first 3 tags are working with regex, and the 4 tags doesn't...I don't know why.
(Jul-03-2021, 08:59 PM)Melcu54 Wrote: [ -> ]But the first 3 tags are working with regex, and the 4 tags doesn't...I don't know why.
You most explain more what the task is like input and wanted output,
code that be run/tested always help a lot.
So, I have an html file with this 4 html tags:

Quote:<p class="text_obisnuit">Can you provide a little more information on the problem you're trying to solve? Are you iterating through tags programatically?</p>,
<p class="text_obisnuit2">At the end of the day, use the most appropriate tool for the job, even in the cases when that tool happens to be a regex.</p>,
<title>It's true that when programming it's usually best to use dedicated parsers</title>
<meta name="description" content=" I only wrote bebe my class when the XML parsers proved unable to withstand real oana use. Religious downvoting just prevents useful answers from being posted - keep things within mother perspective of the question, please."/>

My code must find and translate only those tags that contains at least 3 of the keywords I put in the Regex. In the example above, in the meta description tag, there are 3 keywords that also are in the regex formula: bebe|oana|mother. The first 3 regex works, I test them, but only the 4 regex is skip by Python. I don't know why, but I believe is because the formula regex must start and end with the same string. For example, in title tag, regex starts with <title> and ends with </title> .

But my meta descrition tag, in the regex formula, starts with <meta...and ends with > if it had all ended with meta it would have worked, but cannot end with


from bs4 import BeautifulSoup
from bs4.formatter import HTMLFormatter
from googletrans import Translator
import requests
import re

translator = Translator()

class UnsortedAttributes(HTMLFormatter):
    def attributes(self, tag):
        for k, v in tag.attrs.items():
            yield k, v

files_from_folder = r"e:\Folder3"

use_translate_folder = False

destination_language = 'af'

extension_file = ".html"
pattern1 = r'<p class="text_obisnuit">.*((bebe|oana|mother|sun).*){3,}.*</p>'
pattern2 = r'<p class="text_obisnuit2">.*((bebe|oana|mother|sun).*){3,}.*</p>'
pattern3 = r'<title>.*((bebe|oana|mother|sun).*){3,}.*</title>'
pattern4 = r'<meta name="description" content=.*((bebe|oana|mother|sun).*){3,}.*>'

patterns = [pattern1, pattern2, pattern3, pattern4]
import os
I find the solution. The regex, was trying to translate all the inner contents of the tags, however you the content part of the meta tag isn't inner content. I had to make a separate check to see if it was a meta tag, And then did translation specifically for the meta tag under that check.

So, after those regex, you I should add this code:

 for pattern in patterns:
                for x in re.finditer(pattern, page):
                    updated = True
                    new = x.group(0)
                    soup = BeautifulSoup(new, 'html.parser')
                    if pattern != pattern4:
                        recursively_translate(soup)
                    else:
                        meta = soup.find('meta')
                        meta['content'] = translator.translate(meta['content'], dest=destination_language).text
                    soup = soup.encode(formatter=UnsortedAttributes()).decode('utf-8')
                    page = page.replace(new, soup)