Python Forum
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Regular Expression
#1
Hi, I've just started to learn regular expression and can not figure out what I am doing wrong...

For example I want to retrieve only the words written with capital letters AND between <p> symbols:

import re

s = re.findall(r'<p>([A-Z]+)<p>','<p> BLABLABLA BLABLAksdjf 123 <p> BLA')
The output I would like to see is BLABLABLA and BLABLA

But I get an empty list []

I would appreciated your help/explanation
Reply
#2
If your trying to parse HTML i would suggest to use an HTML/XML parser instead. In BeautifulSoup it would be soup.find('p').text.isupper()
from bs4 import BeautifulSoup

html = '''
<html>
<p>one</p>
<p>TWO</p>
<p>three</p>
<p>FOUR</p>
</html>
'''

soup = BeautifulSoup(html, 'lxml')
ps = soup.find_all('p')
for element in ps:
   if element.text.isupper():
       print(element)
Output:
<p>TWO</p> <p>FOUR</p>
EDIT:
Just saw that you were extracting from a single tag instead of multiples. I would still use a parser to extract the text from the HTML...and just use a simple regex to get the capped letters. In that way you can extend parsing the HTML without making a nasty regex, and simplify regex as much a possible. The pythonic approaches i tried without manually looping the string removed whitespace between the capped letters otherwise i would suggest to not use regex at all.
import re
from bs4 import BeautifulSoup

s = '<p> BLABLABLA BLABLAksdjf 123 <p> BLA'
soup = BeautifulSoup(s, 'lxml')

caps = re.findall('[A-Z]+',soup.p.text)
print(caps)
Output:
['BLABLABLA', 'BLABLA']
Recommended Tutorials:
Reply
#3
Thanks, I'll try BeautifulSoup then
Reply
#4
Of course, the canonical answer is this.
Unless noted otherwise, code in my posts should be understood as "coding suggestions", and its use may require more neurones than the two necessary for Ctrl-C/Ctrl-V.
Your one-stop place for all your GIMP needs: gimp-forum.net
Reply
#5
(Jun-04-2017, 10:24 AM)Ofnuts Wrote: Of course, the canonical answer is this.
This snippet reminds me of playing the game "pony island".
Recommended Tutorials:
Reply
#6
The issue you're facing is because of a small mistake in your regular expression pattern. The correct pattern should be </p> instead of <p>.

Here i have updated your code:-

import re

s = re.findall(r'<p>([A-Za-z]+)</p>', '<p> BLABLABLA BLABLAksdjf 123 </p> BLA')

print(s)
In the above code I changed the pattern to </p> to properly match the closing <p> tag. I also modified the character class [A-Z] to [A-Za-z] to match both uppercase and lowercase letters. Now the regular expression will correctly extract the words written with capital letters between the <p> tags.
Reply
#7
(Aug-21-2023, 12:03 PM)Gaurav_Kumar Wrote: The issue you're facing is because of a small mistake
Let's hope @rakhmadiev finally solved the issue after 6 years. Big Grin
Larz60+ likes this post
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  BeautifulSoup : how to have a html5 attribut searched for in a regular expression ? arbiel 2 2,613 May-09-2020, 03:05 PM
Last Post: arbiel
  Extract text from tag content using regular expression Pavel_47 8 5,206 Nov-25-2019, 03:17 PM
Last Post: buran
  web scraping with python regular expression dbpython2017 6 9,212 Sep-26-2017, 02:16 AM
Last Post: dbpython2017

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020