Bottom Page

Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
 pandas str.extract multiple regex groups with OR
#1
Hi. I have a column of text with inputs like, 3" deep, 4 inches deep, 5" depth. I wanted to extract just the value and put this in a new column.

From regex101.com, I confirm the following captures any of these for all my examples:
r'(\d)\"\s*deep|(\d)\"\s*depth|(\d)\sinches\sdeep'
However, when I try to use this regex pattern with str.extract as follows
df['depth']=df['text'].str.extract(r'(\d)\"\s*deep|(\d)\"\s*depth|(\d)\sinches\sdeep')
Error:
ValueError: Wrong number of items passed 3, placement implies 1
The docs say it should just return the first match. I tried str.extractall just in case, but this produces
Error:
TypeError: incompatible index of inserted column with frame index
How do I pass regex OR patterns to str.extract? Or, how should I match each of these cases to create a new column?

Thanks!
Quote
#2
With sample dataset one can achieve desired result as follows:

>>> df = pd.DataFrame('3" deep, 4 inches deep, 5" depth'.split(','), columns=['Depth'])
>>> df
          Depth
0         3" deep
1   4 inches deep
2        5" depth
>>> df.Depth.str.extract('(\d+)')                                     
   0
0  3
1  4
2  5
EDIT: if numeric value is needed then it probably should be int (or float) datatype. So one can do:

>>> df['Depth_number'] = df.Depth.str.extract('(\d+)').astype(int)
>>> df
            Depth  Depth_number
0         3" deep             3
1   4 inches deep             4
2        5" depth             5
>>> df.dtypes
Depth           object
Depth_number     int64
dtype: object
I'm not 'in'-sane. Indeed, I am so far 'out' of sane that you appear a tiny blip on the distant coast of sanity. Bucky Katt, Get Fuzzy

Da Bishop: There's a dead bishop on the landing. I don't know who keeps bringing them in here. ....but society is to blame.
Quote
#3
Unfortunately the text contains other unrelated numbers, such as 25 items, 2" long, 4 inches deep so I only want the values when they match the regex I provided. I also seem to have a common use case for "OR" regex group matching for extracting other data (e.g. extracting an ID from a text field when it takes one or another discreet pattern). The other way I see to achieve it is to run str.extract for each group creating as many new columns as match groups, and then combine these afterwards. This just seemed inefficient, but perhaps this is the only way possible with str.extract.

regex = r'(\d)\"\s*deep|(\d)\"\s*depth|(\d)\sinches\sdeep')
df['depth1']=df['text'].str.extract(r'(\d)\"\s*deep')
df['depth2']=df['text'].str.extract(r'(\d)\"\s*depth')
df['depth3']=df['text'].str.extract(r'(\d)\sinches\sdeep')

df['depth_final'] = df['depth1'].where(df['depth1'].notnull(), df['depth2'])
df['depth_final'] = df['depth_final'].where(df['depth_final'].notnull(), df['depth3'])
df = df.drop(['depth1','depth2', 'depth3'],axis=1)
Quote

Top Page

Possibly Related Threads...
Thread Author Replies Views Last Post
  Weighted average with multiple weights and groups amyd 0 262 Oct-11-2019, 10:30 AM
Last Post: amyd
  Reading Multiple Sheets using Pandas dhiliptcs 1 502 Sep-30-2019, 11:26 PM
Last Post: scidam
  Handling multiple errors when using datafiles in Pandas alphanov 1 279 Jul-16-2019, 03:17 AM
Last Post: scidam
  How to extract different data groups from multiple CSV files using python Rafiz 3 581 Jun-04-2019, 05:20 PM
Last Post: jefsummers
  extract specific content in a pandas dataframe with a regex? steve1040 0 9,075 Oct-05-2017, 03:17 AM
Last Post: steve1040

Forum Jump:


Users browsing this thread: 1 Guest(s)