Python Forum

Hi, I've just started to learn regular expression and can not figure out what I am doing wrong...

For example I want to retrieve only the words written with capital letters AND between symbols:

import re

s = re.findall(r'<p>([A-Z]+)<p>','<p> BLABLABLA BLABLAksdjf 123 <p> BLA')

The output I would like to see is BLABLABLA and BLABLA

But I get an empty list []

I would appreciated your help/explanation

If your trying to parse HTML i would suggest to use an HTML/XML parser instead. In BeautifulSoup it would be soup.find('p').text.isupper()

from bs4 import BeautifulSoup

html = '''
<html>
<p>one</p>
<p>TWO</p>
<p>three</p>
<p>FOUR</p>
</html>
'''

soup = BeautifulSoup(html, 'lxml')
ps = soup.find_all('p')
for element in ps:
   if element.text.isupper():
       print(element)

Output:<p>TWO</p>
<p>FOUR</p>

EDIT:
Just saw that you were extracting from a single tag instead of multiples. I would still use a parser to extract the text from the HTML...and just use a simple regex to get the capped letters. In that way you can extend parsing the HTML without making a nasty regex, and simplify regex as much a possible. The pythonic approaches i tried without manually looping the string removed whitespace between the capped letters otherwise i would suggest to not use regex at all.

import re
from bs4 import BeautifulSoup

s = '<p> BLABLABLA BLABLAksdjf 123 <p> BLA'
soup = BeautifulSoup(s, 'lxml')

caps = re.findall('[A-Z]+',soup.p.text)
print(caps)

Output:
['BLABLABLA', 'BLABLA']

Thanks, I'll try BeautifulSoup then

Of course, the canonical answer is this.

(Jun-04-2017, 10:24 AM)Ofnuts Wrote: [ -> ]Of course, the canonical answer is this.

This snippet reminds me of playing the game "pony island".

The issue you're facing is because of a small mistake in your regular expression pattern. The correct pattern should be instead of .

Here i have updated your code:-

import re

s = re.findall(r'<p>([A-Za-z]+)</p>', '<p> BLABLABLA BLABLAksdjf 123 </p> BLA')

print(s)

In the above code I changed the pattern to to properly match the closing tag. I also modified the character class [A-Z] to [A-Za-z] to match both uppercase and lowercase letters. Now the regular expression will correctly extract the words written with capital letters between the tags.

(Aug-21-2023, 12:03 PM)Gaurav_Kumar Wrote: [ -> ]The issue you're facing is because of a small mistake

Let's hope @rakhmadiev finally solved the issue after 6 years. Big Grin

rakhmadiev

metulburr

rakhmadiev

Ofnuts

metulburr

Gaurav_Kumar

Gribouillis