Regex: a string does not starts and ends with the same character

Melcu54 · (This post was last modified: Jul-04-2021, 05:00 AM by Melcu54.)

hello. I have a little problem with Python and Regex. In fact, some regex doesn't work if a string does not starts and ends with the same character. For example I have this html tags:

pattern1 = r'<p class="text_obisnuit">.*((bebe|oana|mother|sun).*){3,}.*</p>'
pattern2 = r'<p class="text_obisnuit2">.*((bebe|oana|mother|sun).*){3,}.*</p>'
pattern3 = r'<title>.*((bebe|oana|mother|sun).*){3,}.*</title>'
pattern4 = r'<meta name="description" content=.*((bebe|oana|mother|sun).*){3,}.*>'

The first 3 tags are working, because a string is starts and ends with the same character.

But the 4 tag, the one with meta description, is not working. Python cannot find anything with the regex.

***snippsat*** · (This post was last modified: Jul-03-2021, 08:53 PM by snippsat.)

What are trying do here?
Regex and HTML are not best friends the famous post.
So that's why parsers(BS, lxml) exist to deal with this.

Melcu54 · Jul-03-2021, 08:59 PM

yes, I am using BS

from bs4 import BeautifulSoup
from bs4.formatter import HTMLFormatter
from googletrans import Translator
import requests
import re

But the first 3 tags are working with regex, and the 4 tags doesn't...I don't know why.

***snippsat*** · (This post was last modified: Jul-03-2021, 11:05 PM by snippsat.)

(Jul-03-2021, 08:59 PM)Melcu54 Wrote: But the first 3 tags are working with regex, and the 4 tags doesn't...I don't know why.

You most explain more what the task is like input and wanted output,
code that be run/tested always help a lot.

Melcu54 · Jul-04-2021, 05:18 AM

So, I have an html file with this 4 html tags:

Quote:<p class="text_obisnuit">Can you provide a little more information on the problem you're trying to solve? Are you iterating through tags programatically?</p>,
<p class="text_obisnuit2">At the end of the day, use the most appropriate tool for the job, even in the cases when that tool happens to be a regex.</p>,
<title>It's true that when programming it's usually best to use dedicated parsers</title>
<meta name="description" content=" I only wrote bebe my class when the XML parsers proved unable to withstand real oana use. Religious downvoting just prevents useful answers from being posted - keep things within mother perspective of the question, please."/>

My code must find and translate only those tags that contains at least 3 of the keywords I put in the Regex. In the example above, in the meta description tag, there are 3 keywords that also are in the regex formula: bebe|oana|mother. The first 3 regex works, I test them, but only the 4 regex is skip by Python. I don't know why, but I believe is because the formula regex must start and end with the same string. For example, in title tag, regex starts with <title> and ends with </title> .

But my meta descrition tag, in the regex formula, starts with <meta...and ends with > if it had all ended with meta it would have worked, but cannot end with

from bs4 import BeautifulSoup
from bs4.formatter import HTMLFormatter
from googletrans import Translator
import requests
import re

translator = Translator()

class UnsortedAttributes(HTMLFormatter):
    def attributes(self, tag):
        for k, v in tag.attrs.items():
            yield k, v

files_from_folder = r"e:\Folder3"

use_translate_folder = False

destination_language = 'af'

extension_file = ".html"
pattern1 = r'<p class="text_obisnuit">.*((bebe|oana|mother|sun).*){3,}.*</p>'
pattern2 = r'<p class="text_obisnuit2">.*((bebe|oana|mother|sun).*){3,}.*</p>'
pattern3 = r'<title>.*((bebe|oana|mother|sun).*){3,}.*</title>'
pattern4 = r'<meta name="description" content=.*((bebe|oana|mother|sun).*){3,}.*>'

patterns = [pattern1, pattern2, pattern3, pattern4]
import os

Melcu54 · Jul-04-2021, 07:51 PM

I find the solution. The regex, was trying to translate all the inner contents of the tags, however you the content part of the meta tag isn't inner content. I had to make a separate check to see if it was a meta tag, And then did translation specifically for the meta tag under that check.

So, after those regex, you I should add this code:

 for pattern in patterns:
                for x in re.finditer(pattern, page):
                    updated = True
                    new = x.group(0)
                    soup = BeautifulSoup(new, 'html.parser')
                    if pattern != pattern4:
                        recursively_translate(soup)
                    else:
                        meta = soup.find('meta')
                        meta['content'] = translator.translate(meta['content'], dest=destination_language).text
                    soup = soup.encode(formatter=UnsortedAttributes()).decode('utf-8')
                    page = page.replace(new, soup)

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	How to read a file as binary or hex "string" so that I can do regex search?	tatahuft	3	1,468	Dec-19-2024, 11:57 AM Last Post: snippsat
	Move column to the right if it starts with a letter	mfernandes	0	1,273	Oct-25-2022, 11:22 AM Last Post: mfernandes
	Writing string to file results in one character per line	RB76SFJPsJJDu3bMnwYM	4	4,111	Sep-27-2022, 01:38 PM Last Post: buran
	pywin32: Outlook connection ends with 'operation aborted' on one machine	tstone	0	3,739	May-03-2022, 04:29 AM Last Post: tstone
	Setup Portable Python on Windows for script starts with double clicks?	pstein	0	2,813	Feb-18-2022, 01:29 PM Last Post: pstein
	[solved] unexpected character after line continuation character	paul18fr	4	7,722	Jun-22-2021, 03:22 PM Last Post: deanhystad
	threadlocals are garbage collected before thread ends	akv1597	0	2,469	Mar-09-2021, 12:13 PM Last Post: akv1597
	Running a few lines of code as soon as my timer ends	nethatar	3	3,417	Feb-26-2021, 01:02 PM Last Post: jefsummers
	'\|' character within Regex returns a tuple?	pprod	10	9,320	Feb-19-2021, 05:29 PM Last Post: eddywinch82
	Writing to file ends incorrectly	project_science	4	4,050	Jan-06-2021, 06:39 PM Last Post: bowlofred

Regex: a string does not starts and ends with the same character

User Panel Messages

Announcements