Hi, I've just started to learn regular expression and can not figure out what I am doing wrong...
For example I want to retrieve only the words written with capital letters AND between <p> symbols:
import re
s = re.findall(r'<p>([A-Z]+)<p>','<p> BLABLABLA BLABLAksdjf 123 <p> BLA')
The output I would like to see is BLABLABLA and BLABLA
But I get an empty list []
I would appreciated your help/explanation
If your trying to parse HTML i would suggest to use an HTML/XML parser instead. In BeautifulSoup it would be soup.find('p').text.isupper()
from bs4 import BeautifulSoup
html = '''
<html>
<p>one</p>
<p>TWO</p>
<p>three</p>
<p>FOUR</p>
</html>
'''
soup = BeautifulSoup(html, 'lxml')
ps = soup.find_all('p')
for element in ps:
if element.text.isupper():
print(element)
Output:
<p>TWO</p>
<p>FOUR</p>
EDIT:
Just saw that you were extracting from a single tag instead of multiples. I would still use a parser to extract the text from the HTML...and just use a simple regex to get the capped letters. In that way you can extend parsing the HTML without making a nasty regex, and simplify regex as much a possible. The pythonic approaches i tried without manually looping the string removed whitespace between the capped letters otherwise i would suggest to not use regex at all.
import re
from bs4 import BeautifulSoup
s = '<p> BLABLABLA BLABLAksdjf 123 <p> BLA'
soup = BeautifulSoup(s, 'lxml')
caps = re.findall('[A-Z]+',soup.p.text)
print(caps)
Output:
['BLABLABLA', 'BLABLA']
Thanks, I'll try BeautifulSoup then
Of course, the canonical answer is
this.
(Jun-04-2017, 10:24 AM)Ofnuts Wrote: [ -> ]Of course, the canonical answer is this.
This snippet reminds me of playing the game "
pony island".
The issue you're facing is because of a small mistake in your regular expression pattern. The correct pattern should be </p> instead of <p>.
Here i have updated your code:-
import re
s = re.findall(r'<p>([A-Za-z]+)</p>', '<p> BLABLABLA BLABLAksdjf 123 </p> BLA')
print(s)
In the above code I changed the pattern to </p> to properly match the closing <p> tag. I also modified the character class [A-Z] to [A-Za-z] to match both uppercase and lowercase letters. Now the regular expression will correctly extract the words written with capital letters between the <p> tags.
(Aug-21-2023, 12:03 PM)Gaurav_Kumar Wrote: [ -> ]The issue you're facing is because of a small mistake
Let's hope @
rakhmadiev finally solved the issue after 6 years.
