Substr on Pandas Dataframe - Printable Version

Substr on Pandas Dataframe - Printable Version

+- Python Forum (https://python-forum.io)
+-- Forum: Python Coding (https://python-forum.io/forum-7.html)
+--- Forum: Data Science (https://python-forum.io/forum-44.html)
+--- Thread: Substr on Pandas Dataframe (/thread-20818.html)

Substr on Pandas Dataframe - Scott - Sep-01-2019

Hi everyone,

I have a DF and I want to set an if statement in a function to sum a value if the first part of a field = '10'. This would be easy in SAS with the substr function. Can I do it in a dataframe or do I need to put it into an array and slice?

I have pasted the DF below, the column headers don't align well but you can make it out.

Output:	HSC	Country	Month	Imports_(NZD)	Harmonised System Description
0	101210015	New Zealand	201903	191,550	Horses; live, pure-bred breeding animals, thor...
1	101210015	New Zealand	201904	190,550	Horses; live, pure-bred breeding animals, thor...
2	101290010	New Zealand	201903	76,660	Horses; live, other than pure-bred breeding an...
3	101290010	New Zealand	201904	1,187,430	Horses; live, other than pure-bred breeding an...
4	101290013	New Zealand	201904	1,257,700	Horses; live, other than pure-bred breeding an...

What i want is an output with month as the index and then a new variable summed Import by is substr(hsc,0,2) = '01' which is grouped by month. I just want help with first variable and then I am going to create a few more summs based on the HSC that are grouped by month and have them as the new columns.

I hope that makes sense. Please let me know if you need more info.

Thanks

RE: Substr on Pandas Dataframe - scidam - Sep-02-2019

(Sep-01-2019, 06:21 AM)Scott Wrote: I have a DF and I want to set an if statement in a function to sum a value if the first part of a field = '10'.

You need to convert values to strings first and use .str.startswith method.

Take a look at the following minimal example I just wrote:

import pandas as pd
df = pd.DataFrame({"x": [100, 1000, 1000, 1919, 124], "y": [1, 2, 3, 4, 5]})
df.loc[df.x.astype(str).str.startswith('10'), 'y'].sum()