Regular Expression

rakhmadiev · (This post was last modified: May-31-2017, 10:48 PM by rakhmadiev.)

Hi, I've just started to learn regular expression and can not figure out what I am doing wrong...

For example I want to retrieve only the words written with capital letters AND between symbols:

import re

s = re.findall(r'<p>([A-Z]+)<p>','<p> BLABLABLA BLABLAksdjf 123 <p> BLA')

The output I would like to see is BLABLABLA and BLABLA

But I get an empty list []

I would appreciated your help/explanation

***metulburr*** · (This post was last modified: Jun-01-2017, 01:39 AM by metulburr.)

If your trying to parse HTML i would suggest to use an HTML/XML parser instead. In BeautifulSoup it would be soup.find('p').text.isupper()

from bs4 import BeautifulSoup

html = '''
<html>
<p>one</p>
<p>TWO</p>
<p>three</p>
<p>FOUR</p>
</html>
'''

soup = BeautifulSoup(html, 'lxml')
ps = soup.find_all('p')
for element in ps:
   if element.text.isupper():
       print(element)

Output:<p>TWO</p>
<p>FOUR</p>

EDIT:
Just saw that you were extracting from a single tag instead of multiples. I would still use a parser to extract the text from the HTML...and just use a simple regex to get the capped letters. In that way you can extend parsing the HTML without making a nasty regex, and simplify regex as much a possible. The pythonic approaches i tried without manually looping the string removed whitespace between the capped letters otherwise i would suggest to not use regex at all.

import re
from bs4 import BeautifulSoup

s = '<p> BLABLABLA BLABLAksdjf 123 <p> BLA'
soup = BeautifulSoup(s, 'lxml')

caps = re.findall('[A-Z]+',soup.p.text)
print(caps)

Output:
['BLABLABLA', 'BLABLA']

rakhmadiev · Jun-03-2017, 11:05 PM

Thanks, I'll try BeautifulSoup then

***Ofnuts*** · Jun-04-2017, 10:24 AM

Of course, the canonical answer is this.

***metulburr*** · Jun-04-2017, 05:47 PM

(Jun-04-2017, 10:24 AM)Ofnuts Wrote: Of course, the canonical answer is this.

This snippet reminds me of playing the game "pony island".

Gaurav_Kumar · Aug-21-2023, 12:03 PM

The issue you're facing is because of a small mistake in your regular expression pattern. The correct pattern should be instead of .

Here i have updated your code:-

import re

s = re.findall(r'<p>([A-Za-z]+)</p>', '<p> BLABLABLA BLABLAksdjf 123 </p> BLA')

print(s)

In the above code I changed the pattern to to properly match the closing tag. I also modified the character class [A-Z] to [A-Za-z] to match both uppercase and lowercase letters. Now the regular expression will correctly extract the words written with capital letters between the tags.

**Gribouillis** · (This post was last modified: Aug-21-2023, 01:53 PM by Gribouillis.)

(Aug-21-2023, 12:03 PM)Gaurav_Kumar Wrote: The issue you're facing is because of a small mistake

Let's hope @rakhmadiev finally solved the issue after 6 years. Big Grin

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	BeautifulSoup : how to have a html5 attribut searched for in a regular expression ?	arbiel	2	2,613	May-09-2020, 03:05 PM Last Post: arbiel
	Extract text from tag content using regular expression	Pavel_47	8	5,206	Nov-25-2019, 03:17 PM Last Post: buran
	web scraping with python regular expression	dbpython2017	6	9,212	Sep-26-2017, 02:16 AM Last Post: dbpython2017

Regular Expression

User Panel Messages

Announcements