Python Forum

Full Version: Regular Expression
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Hi, I've just started to learn regular expression and can not figure out what I am doing wrong...

For example I want to retrieve only the words written with capital letters AND between <p> symbols:

import re

s = re.findall(r'<p>([A-Z]+)<p>','<p> BLABLABLA BLABLAksdjf 123 <p> BLA')
The output I would like to see is BLABLABLA and BLABLA

But I get an empty list []

I would appreciated your help/explanation
If your trying to parse HTML i would suggest to use an HTML/XML parser instead. In BeautifulSoup it would be soup.find('p').text.isupper()
from bs4 import BeautifulSoup

html = '''
<html>
<p>one</p>
<p>TWO</p>
<p>three</p>
<p>FOUR</p>
</html>
'''

soup = BeautifulSoup(html, 'lxml')
ps = soup.find_all('p')
for element in ps:
   if element.text.isupper():
       print(element)
Output:
<p>TWO</p> <p>FOUR</p>
EDIT:
Just saw that you were extracting from a single tag instead of multiples. I would still use a parser to extract the text from the HTML...and just use a simple regex to get the capped letters. In that way you can extend parsing the HTML without making a nasty regex, and simplify regex as much a possible. The pythonic approaches i tried without manually looping the string removed whitespace between the capped letters otherwise i would suggest to not use regex at all.
import re
from bs4 import BeautifulSoup

s = '<p> BLABLABLA BLABLAksdjf 123 <p> BLA'
soup = BeautifulSoup(s, 'lxml')

caps = re.findall('[A-Z]+',soup.p.text)
print(caps)
Output:
['BLABLABLA', 'BLABLA']
Thanks, I'll try BeautifulSoup then
Of course, the canonical answer is this.
(Jun-04-2017, 10:24 AM)Ofnuts Wrote: [ -> ]Of course, the canonical answer is this.
This snippet reminds me of playing the game "pony island".
The issue you're facing is because of a small mistake in your regular expression pattern. The correct pattern should be </p> instead of <p>.

Here i have updated your code:-

import re

s = re.findall(r'<p>([A-Za-z]+)</p>', '<p> BLABLABLA BLABLAksdjf 123 </p> BLA')

print(s)
In the above code I changed the pattern to </p> to properly match the closing <p> tag. I also modified the character class [A-Z] to [A-Za-z] to match both uppercase and lowercase letters. Now the regular expression will correctly extract the words written with capital letters between the <p> tags.
(Aug-21-2023, 12:03 PM)Gaurav_Kumar Wrote: [ -> ]The issue you're facing is because of a small mistake
Let's hope @rakhmadiev finally solved the issue after 6 years. Big Grin