Python Forum
DataFrame.astype('category') duplicates column
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
DataFrame.astype('category') duplicates column
#1
Hi. I have a problem on convesion of object type into category.
My data shape is (1000000, 6)[Date, object,object, object, int64, column_1]
when using the below code, it duplicates last column, the column_1.

df.column_1 = df.column_1.astype('category')

before conversion it is in object type, after conversion it shows category but already duplicated.

one more point. the label of the duplicated column contains whitespace in the end of it.


thanks in advance
Reply
#2
(Apr-18-2018, 05:31 AM)garikhgh0 Wrote: before conversion it is in object type, after conversion it shows category but already duplicated.

You can get unique values of the categorical column as follows:

df.column_1 = df.column_1.astype('category')
df.column_1.cat.categories #unique categories
(Apr-18-2018, 05:31 AM)garikhgh0 Wrote: the label of the duplicated column contains whitespace in the end of it.
Didn't understand, but if want to remove duplicates from the original data frame, you can use drop_duplicates method.
e.g.
df = df.drop_duplicates(['column_1'])  # or append .reset_index(drop=True) if needed 
# removes rows with duplicated values in column_1
Reply
#3
thanks a lot. I would also mention that, when converting objects itno category, the Dtaframe.pivot_table does not work correctly. creates duplictaes
Reply
#4
(Apr-18-2018, 07:33 AM)garikhgh0 Wrote: the Dtaframe.pivot_table does not work correctly

Was trying to reproduce, but couldn't find the error:

import pandas as pd
data = pd.DataFrame({'x': pd.np.random.randint(0,100,1000), 'y': pd.np.random.choice(['a', 'b', 'c'], 1000)})
pd.pivot_table(data, aggfunc=pd.np.sum, values='x', columns=['y'])
Output:
y a b c x 16924 16650 16377
# change column type
data.y = data.y.astype('category') 
pd.pivot_table(data, aggfunc=pd.np.sum, values='x', columns=['y'])
# the result is the same...
Output:
y a b c x 16924 16650 16377
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  concat 3 columns of dataframe to one column flash77 2 778 Oct-03-2023, 09:29 PM
Last Post: flash77
  HTML Decoder pandas dataframe column mbrown009 3 962 Sep-29-2023, 05:56 PM
Last Post: deanhystad
  attempt to split values from within a dataframe column mbrown009 8 2,223 Apr-10-2023, 02:06 AM
Last Post: mbrown009
  Add group number for duplicates atomxkai 2 1,094 Dec-08-2022, 06:08 AM
Last Post: atomxkai
  Counting Duplicates in large Data Set jmair 3 1,092 Dec-07-2022, 09:42 AM
Last Post: paul18fr
  New Dataframe Column Based on Several Conditions nb1214 1 1,783 Nov-16-2021, 10:52 PM
Last Post: jefsummers
  Kaggle Titanic - new category placement snakes 0 1,653 Oct-18-2021, 07:53 PM
Last Post: snakes
  Putting column name to dataframe, can't work. jonah88888 1 1,804 Sep-28-2021, 07:45 PM
Last Post: deanhystad
  Setting the x-axis to a specific column in a dataframe devansing 0 1,993 May-23-2021, 12:11 AM
Last Post: devansing
Question [Solved] How to refer to dataframe column name based on a list lorensa74 1 2,239 May-17-2021, 07:02 AM
Last Post: lorensa74

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020