Python Forum

Full Version: '|' character within Regex returns a tuple?
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Pages: 1 2
Hi,
Using the '|' character within a Regex is giving me an undesirable result that I have been unable to avoid. For example, consider a 2-page file with the following text in each page:

Page 1:
111A111 #red.

Page 2:
AAA1AAA #green.

for i in range(0,2):
    text = doc.getPage(i).extract_text()

    color_re = re.compile(r'#\w+\.')
    color = color_re.findall(text)
    print(color)
Output:
['red.'] ['green.']
    pattern_re = re.compile(r'(\w+\d+\w+)|(\d+\w+\d+)')
        pattern = pattern_re.findall(text)
        print(pattern)
Output:
('', 'AAA1AAA') ('111A111', '')

If I do:
color =[item.strip('.') for item in color]
I get rid of '.' so, all is good.

But if I do:
pattern = [item.strip(' , ') for item in pattern]
I get the error:
Output:
AttributeError: 'tuple' object has no attribute 'strip'
Is there a way to avoid this error? I need to get rid of the spaces and commas in 'pattern'.
Thanks and apologies in advance if the question is not properly formulated. I'm a beginner.
Hi pprod,

You might be getting that error, due to the spacing between the quotes for the comma, after item.strip try :-

pattern = [item.strip(',') for item in pattern]
Best Regards

Eddie Winch
(Feb-19-2021, 04:33 PM)eddywinch82 Wrote: [ -> ]Hi pprod,

You might be getting that error, due to the spacing between the quotes for the comma, after item.strip try :-

pattern = [item.strip(',') for item in pattern]
Best Regards

Eddie Winch

Thanks, Eddie. I've tried your suggestion but it still doesn't work. Please note that I updated the post and amended the output of print(pattern). Apologies I can't provide the full code and the file I'm using as it is confidential.
For removing the spaces try :-

pattern = [item.replace(" ", "") for item in pattern]
I hope that works for you.

Regards

Eddie Winch
For the removal of commas, maybe try :-

pattern =[item.strip(',') for item in pattern]
Regards

Eddie Winch
And for the spaces removal, if the following doesn't work :-

pattern = [item.replace(" ", "") for item in pattern]
Try :-

pattern =[item.replace(" ", "") for item in pattern]
Eddie Winch ))
Still no luck. I keep getting the error:
Output:
AttributeError: 'tuple' object has no attribute 'strip'
Output:
AttributeError: 'tuple' object has no attribute 'replace'
I suspect it has to do with the character '|' within the Regex as I don't get this error for the variable 'color'. Maybe if I convert the tuple to a list then I can use strip() or replace()? Thanks for your time.
Oops. double posted.
When you set up a capturing regex, it numbers the capturing parentheses from left to right. So in a pattern like this:
>>> re.findall(r"(\w+\d+\w+)|(\d+\w+\d+)", "AAA1AAA")
[('AAA1AAA', '')]
you get a tuple with each element being the capture from each capture group.

It's not the pipe character, it's the parentheses.

Even if only one can match, they're still numbered and set from left to right. So the group you get back is a tuple with all the capture groups. To find what's in there, you can either loop through the elements of the tuple, or you can rewrite the regex so there's only one (or zero) capture groups.

If the parenthesis starts with ?:, then it won't be a capture group. That allows the pattern match to go back to "the entire pattern" and you don't have a tuple any longer.

>>> re.findall(r"(?:\w+\d+\w+)|(?:\d+\w+\d+)", "AAA1AAA")
['AAA1AAA']
>>> re.findall(r"(?:\w+\d+\w+)|(?:\d+\w+\d+)", "111A111")
['111A111']
(Feb-19-2021, 05:15 PM)bowlofred Wrote: [ -> ]Oops. double posted.

Thanks, bowlofred. That worked fine. I don't think I'd figure that out any time soon.
Thanks guys!
Pages: 1 2