Python Forum
Regex: a string does not starts and ends with the same character
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Regex: a string does not starts and ends with the same character
#1
hello. I have a little problem with Python and Regex. In fact, some regex doesn't work if a string does not starts and ends with the same character. For example I have this html tags:

pattern1 = r'<p class="text_obisnuit">.*((bebe|oana|mother|sun).*){3,}.*</p>'
pattern2 = r'<p class="text_obisnuit2">.*((bebe|oana|mother|sun).*){3,}.*</p>'
pattern3 = r'<title>.*((bebe|oana|mother|sun).*){3,}.*</title>'
pattern4 = r'<meta name="description" content=.*((bebe|oana|mother|sun).*){3,}.*>'
The first 3 tags are working, because a string is starts and ends with the same character.

But the 4 tag, the one with meta description, is not working. Python cannot find anything with the regex.
Reply
#2
What are trying do here?
Regex and HTML are not best friends the famous post.
So that's why parsers(BS, lxml) exist to deal with this.
Reply
#3
yes, I am using BS

from bs4 import BeautifulSoup
from bs4.formatter import HTMLFormatter
from googletrans import Translator
import requests
import re
But the first 3 tags are working with regex, and the 4 tags doesn't...I don't know why.
Reply
#4
(Jul-03-2021, 08:59 PM)Melcu54 Wrote: But the first 3 tags are working with regex, and the 4 tags doesn't...I don't know why.
You most explain more what the task is like input and wanted output,
code that be run/tested always help a lot.
Reply
#5
So, I have an html file with this 4 html tags:

Quote:<p class="text_obisnuit">Can you provide a little more information on the problem you're trying to solve? Are you iterating through tags programatically?</p>,
<p class="text_obisnuit2">At the end of the day, use the most appropriate tool for the job, even in the cases when that tool happens to be a regex.</p>,
<title>It's true that when programming it's usually best to use dedicated parsers</title>
<meta name="description" content=" I only wrote bebe my class when the XML parsers proved unable to withstand real oana use. Religious downvoting just prevents useful answers from being posted - keep things within mother perspective of the question, please."/>

My code must find and translate only those tags that contains at least 3 of the keywords I put in the Regex. In the example above, in the meta description tag, there are 3 keywords that also are in the regex formula: bebe|oana|mother. The first 3 regex works, I test them, but only the 4 regex is skip by Python. I don't know why, but I believe is because the formula regex must start and end with the same string. For example, in title tag, regex starts with <title> and ends with </title> .

But my meta descrition tag, in the regex formula, starts with <meta...and ends with > if it had all ended with meta it would have worked, but cannot end with


from bs4 import BeautifulSoup
from bs4.formatter import HTMLFormatter
from googletrans import Translator
import requests
import re

translator = Translator()

class UnsortedAttributes(HTMLFormatter):
    def attributes(self, tag):
        for k, v in tag.attrs.items():
            yield k, v

files_from_folder = r"e:\Folder3"

use_translate_folder = False

destination_language = 'af'

extension_file = ".html"
pattern1 = r'<p class="text_obisnuit">.*((bebe|oana|mother|sun).*){3,}.*</p>'
pattern2 = r'<p class="text_obisnuit2">.*((bebe|oana|mother|sun).*){3,}.*</p>'
pattern3 = r'<title>.*((bebe|oana|mother|sun).*){3,}.*</title>'
pattern4 = r'<meta name="description" content=.*((bebe|oana|mother|sun).*){3,}.*>'

patterns = [pattern1, pattern2, pattern3, pattern4]
import os
Reply
#6
I find the solution. The regex, was trying to translate all the inner contents of the tags, however you the content part of the meta tag isn't inner content. I had to make a separate check to see if it was a meta tag, And then did translation specifically for the meta tag under that check.

So, after those regex, you I should add this code:

 for pattern in patterns:
                for x in re.finditer(pattern, page):
                    updated = True
                    new = x.group(0)
                    soup = BeautifulSoup(new, 'html.parser')
                    if pattern != pattern4:
                        recursively_translate(soup)
                    else:
                        meta = soup.find('meta')
                        meta['content'] = translator.translate(meta['content'], dest=destination_language).text
                    soup = soup.encode(formatter=UnsortedAttributes()).decode('utf-8')
                    page = page.replace(new, soup)
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  Move column to the right if it starts with a letter mfernandes 0 675 Oct-25-2022, 11:22 AM
Last Post: mfernandes
  Writing string to file results in one character per line RB76SFJPsJJDu3bMnwYM 4 1,369 Sep-27-2022, 01:38 PM
Last Post: buran
  pywin32: Outlook connection ends with 'operation aborted' on one machine tstone 0 2,376 May-03-2022, 04:29 AM
Last Post: tstone
  Setup Portable Python on Windows for script starts with double clicks? pstein 0 1,813 Feb-18-2022, 01:29 PM
Last Post: pstein
  [solved] unexpected character after line continuation character paul18fr 4 3,395 Jun-22-2021, 03:22 PM
Last Post: deanhystad
  threadlocals are garbage collected before thread ends akv1597 0 1,797 Mar-09-2021, 12:13 PM
Last Post: akv1597
  Running a few lines of code as soon as my timer ends nethatar 3 2,401 Feb-26-2021, 01:02 PM
Last Post: jefsummers
  '|' character within Regex returns a tuple? pprod 10 5,557 Feb-19-2021, 05:29 PM
Last Post: eddywinch82
  Writing to file ends incorrectly project_science 4 2,688 Jan-06-2021, 06:39 PM
Last Post: bowlofred
  Help getting a string out of regex matt_the_hall 4 2,262 Dec-02-2020, 01:49 AM
Last Post: matt_the_hall

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020