pandas str.extract multiple regex groups with OR

pythonidae · Dec-19-2019, 09:38 AM

Hi. I have a column of text with inputs like, 3" deep, 4 inches deep, 5" depth. I wanted to extract just the value and put this in a new column.

From regex101.com, I confirm the following captures any of these for all my examples:

r'(\d)\"\s*deep|(\d)\"\s*depth|(\d)\sinches\sdeep'

However, when I try to use this regex pattern with str.extract as follows

df['depth']=df['text'].str.extract(r'(\d)\"\s*deep|(\d)\"\s*depth|(\d)\sinches\sdeep')

Error:
ValueError: Wrong number of items passed 3, placement implies 1

The docs say it should just return the first match. I tried str.extractall just in case, but this produces

Error:
TypeError: incompatible index of inserted column with frame index

How do I pass regex OR patterns to str.extract? Or, how should I match each of these cases to create a new column?

Thanks!

**perfringo** · (This post was last modified: Dec-19-2019, 02:17 PM by perfringo.)

With sample dataset one can achieve desired result as follows:

>>> df = pd.DataFrame('3" deep, 4 inches deep, 5" depth'.split(','), columns=['Depth'])
>>> df
          Depth
0         3" deep
1   4 inches deep
2        5" depth
>>> df.Depth.str.extract('(\d+)')                                     
   0
0  3
1  4
2  5

EDIT: if numeric value is needed then it probably should be int (or float) datatype. So one can do:

>>> df['Depth_number'] = df.Depth.str.extract('(\d+)').astype(int)
>>> df
            Depth  Depth_number
0         3" deep             3
1   4 inches deep             4
2        5" depth             5
>>> df.dtypes
Depth           object
Depth_number     int64
dtype: object

pythonidae · (This post was last modified: Dec-19-2019, 05:43 PM by pythonidae.)

Unfortunately the text contains other unrelated numbers, such as 25 items, 2" long, 4 inches deep so I only want the values when they match the regex I provided. I also seem to have a common use case for "OR" regex group matching for extracting other data (e.g. extracting an ID from a text field when it takes one or another discreet pattern). The other way I see to achieve it is to run str.extract for each group creating as many new columns as match groups, and then combine these afterwards. This just seemed inefficient, but perhaps this is the only way possible with str.extract.

regex = r'(\d)\"\s*deep|(\d)\"\s*depth|(\d)\sinches\sdeep')
df['depth1']=df['text'].str.extract(r'(\d)\"\s*deep')
df['depth2']=df['text'].str.extract(r'(\d)\"\s*depth')
df['depth3']=df['text'].str.extract(r'(\d)\sinches\sdeep')

df['depth_final'] = df['depth1'].where(df['depth1'].notnull(), df['depth2'])
df['depth_final'] = df['depth_final'].where(df['depth_final'].notnull(), df['depth3'])
df = df.drop(['depth1','depth2', 'depth3'],axis=1)

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	Import multiple CSV files into pandas	Krayna	0	1,717	May-20-2021, 04:56 PM Last Post: Krayna
	Weighted average with multiple weights and groups	amyd	0	2,110	Oct-11-2019, 10:30 AM Last Post: amyd
	Reading Multiple Sheets using Pandas	dhiliptcs	1	4,045	Sep-30-2019, 11:26 PM Last Post: scidam
	Handling multiple errors when using datafiles in Pandas	alphanov	1	1,845	Jul-16-2019, 03:17 AM Last Post: scidam
	How to extract different data groups from multiple CSV files using python	Rafiz	3	3,242	Jun-04-2019, 05:20 PM Last Post: jefsummers
	extract specific content in a pandas dataframe with a regex?	steve1040	0	13,538	Oct-05-2017, 03:17 AM Last Post: steve1040

pandas str.extract multiple regex groups with OR

User Panel Messages

Announcements