Python Forum
pandas str.extract multiple regex groups with OR - Printable Version

+- Python Forum (https://python-forum.io)
+-- Forum: Python Coding (https://python-forum.io/forum-7.html)
+--- Forum: Data Science (https://python-forum.io/forum-44.html)
+--- Thread: pandas str.extract multiple regex groups with OR (/thread-23272.html)



pandas str.extract multiple regex groups with OR - pythonidae - Dec-19-2019

Hi. I have a column of text with inputs like, 3" deep, 4 inches deep, 5" depth. I wanted to extract just the value and put this in a new column.

From regex101.com, I confirm the following captures any of these for all my examples:
r'(\d)\"\s*deep|(\d)\"\s*depth|(\d)\sinches\sdeep'
However, when I try to use this regex pattern with str.extract as follows
df['depth']=df['text'].str.extract(r'(\d)\"\s*deep|(\d)\"\s*depth|(\d)\sinches\sdeep')
Error:
ValueError: Wrong number of items passed 3, placement implies 1
The docs say it should just return the first match. I tried str.extractall just in case, but this produces
Error:
TypeError: incompatible index of inserted column with frame index
How do I pass regex OR patterns to str.extract? Or, how should I match each of these cases to create a new column?

Thanks!


RE: pandas str.extract multiple regex groups with OR - perfringo - Dec-19-2019

With sample dataset one can achieve desired result as follows:

>>> df = pd.DataFrame('3" deep, 4 inches deep, 5" depth'.split(','), columns=['Depth'])
>>> df
          Depth
0         3" deep
1   4 inches deep
2        5" depth
>>> df.Depth.str.extract('(\d+)')                                     
   0
0  3
1  4
2  5
EDIT: if numeric value is needed then it probably should be int (or float) datatype. So one can do:

>>> df['Depth_number'] = df.Depth.str.extract('(\d+)').astype(int)
>>> df
            Depth  Depth_number
0         3" deep             3
1   4 inches deep             4
2        5" depth             5
>>> df.dtypes
Depth           object
Depth_number     int64
dtype: object



RE: pandas str.extract multiple regex groups with OR - pythonidae - Dec-19-2019

Unfortunately the text contains other unrelated numbers, such as 25 items, 2" long, 4 inches deep so I only want the values when they match the regex I provided. I also seem to have a common use case for "OR" regex group matching for extracting other data (e.g. extracting an ID from a text field when it takes one or another discreet pattern). The other way I see to achieve it is to run str.extract for each group creating as many new columns as match groups, and then combine these afterwards. This just seemed inefficient, but perhaps this is the only way possible with str.extract.

regex = r'(\d)\"\s*deep|(\d)\"\s*depth|(\d)\sinches\sdeep')
df['depth1']=df['text'].str.extract(r'(\d)\"\s*deep')
df['depth2']=df['text'].str.extract(r'(\d)\"\s*depth')
df['depth3']=df['text'].str.extract(r'(\d)\sinches\sdeep')

df['depth_final'] = df['depth1'].where(df['depth1'].notnull(), df['depth2'])
df['depth_final'] = df['depth_final'].where(df['depth_final'].notnull(), df['depth3'])
df = df.drop(['depth1','depth2', 'depth3'],axis=1)