Python Forum

Full Version: Pandas/Dataframes, Strings and Regular Expressions...
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Hi all

I am new in the Python world (20 years ago I did some C/C++). For being new, I was able to achieve quite a lot so far. I successfully got around RegEx all these years, but this seems to change now...

With this problem I didn't get a solution so far, I also think so far I have not fully understood the indexing/selecting mechanism.

I have a data frame 'data_total' (what a thrilling name...) with the column INFO. It contains strings like 'X-Z-34567A' or 'X-Y-123456'.
I'd like to extract the numbers into a new column INFO_NR. The letter on the tail is to replace with a '0'.
After all, data should read '345670' and '123456'

First I tried a slightly other way: I extracted the number part, converted it to int and multiplied by 10.

See the following code snippet:

# this processes the X-Z-34567A correctly, fills the fields of the other rows with nan
data_total['INFO_NR'] = data_total['INFO'].str.extract('^X-\w-(\d*)[ABCDEFGHILKMNOPQRSTUVWXYZ]$', expand=False).str.strip()
data_total['INFO_NR'] = data_total['INFO_NR'].fillna('0')
data_total['INFO_NR'] = data_total['INFO_NR'].astype(np.int64)*10

# this processes the X-Y-123456 correctly, but fills the previously processed fields with nan!!
data_total['INFO_NR'] = data_total['INFO'].str.extract('^X-\w-(\d*)$', expand=False).str.strip()
data_total['INFO_NR'] = data_total['INFO_NR'].astype(np.int64)*10
Both the regexes work, but the second deletes the results of the first. How can I apply the second regex only on the rows with INFO_NR == 0, without deleting the first results?

And how I got to know Python so far, there should be a much more elegant solution out there Smile

Looking forward to your inputs
Thank you
Stephan