Python Forum
Does the order of columns in the DataFrame matter?
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Does the order of columns in the DataFrame matter?
#1
Hi, I noticed that sometimes when doing certain operations, the order of the columns in the DataFrame got changed automatically. Same thing happened when I tried out some examples in books. Using the same commands, my DataFrames have the columns shown in different orders as those shown in the books. Does the order of DataFrame matters in Python/pandas/numpy?
Reply
#2
Yes, it does.
Look at the following example

import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.rand(10, 5)) 
print(df.values)  # print corresponding numpy array
print(df[[2,1,3,4,0]].values) # reorder columns and print
The answer depends on how you are accessing data in data-frame. If you access columns by names, e.g. df.loc[:, 'some_name'] and never use index-based access, e.g. something like df.iloc[:, 4], you can not worry about the order of columns.
Reply
#3
Thanks. I came across the following example:

# Example 2

In [194]: lefth = pd.DataFrame({'key1': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada'], 
     ...:                       'key2': [2000, 2001, 2002, 2001, 2002], 
     ...:                       'data': np.arange(5.)})  
In [196]: lefth                                                                                           
Out[196]: 
     key1  key2  data
0    Ohio  2000   0.0
1    Ohio  2001   1.0
2    Ohio  2002   2.0
3  Nevada  2001   3.0
4  Nevada  2002   4.0
As indicated above, on my machine the columns are listed as key1, key2 and data which seems to be according to the order I entered the columns in the pd.DataFrame command. However, the person who made this example has the columns displayed as data followed by key1 and key2 using the same command. How come? I don't quite remember well but I think somebody mentioned that depending on the version python is used, the columns could be arranged differently. Is this true?

Does that mean it is always better to access the columns by names because the order of columns could be arranged differently for unknown reason and people could obtain different results or even errors when using the index-based access method?
Reply
#4
(Feb-13-2020, 05:12 PM)new_to_python Wrote: As indicated above, on my machine the columns are listed as key1, key2 and data which seems to be according to the order I entered the columns in the pd.DataFrame command. However, the person who made this example has the columns displayed as data followed by key1 and key2 using the same command. How come? I don't quite remember well but I think somebody mentioned that depending on the version python is used, the columns could be arranged differently. Is this true?

This depends on implementation of dict data structure in Python. Prior to Python 3.6 dict structure was unordered "(key, val)" structure.
So, when iterating over dict you can theoretically get different order of items (at least, for different Python versions, implementations), and, therefore, this lead to different order of columns in Pandas dataframe. However, since CPython 3.6+ (or Python 3.7+ for any other implementation of Python), dict preserves the order of item insertion.

In general, to be sure the order of columns is correct, you can always do:
df = df.loc[:, ['col_1', 'col_2', 'col_3']]
After that, you can rely on your particular order of columns and access them by integer incidences.
Reply
#5
Thank you very much.

By the way, what do you think is the cause of the re-ordering between Ohio and Colorado from Step 235 to 236?

In [234]: df                                                                                              
Out[234]: 
side             left  right
state    number             
Ohio     one        0      5
         two        1      6
         three      2      7
Colorado one        3      8
         two        4      9
         three      5     10

In [235]: df.unstack('state')                                                                                                                                                        
Out[235]: 
side   left          right         
state  Ohio Colorado  Ohio Colorado
number                             
one       0        3     5        8
two       1        4     6        9
three     2        5     7       10

In [236]: df.unstack('state').stack('side')                                                               
Out[236]: 
state         Colorado  Ohio
number side                 
one    left          3     0
       right         8     5
two    left          4     1
       right         9     6
three  left          5     2
       right        10     7
Reply
#6
I cannot answer exactly, but I think this is because some sorting operation is applied to the index used in .unstack or .stack.
If you look at _Unstacker implementation, you can find that it includes some sorting operations are being applied in different places of the code.
This is likely the cause of the reordering.
Reply
#7
Thanks scidam. In this case, what is the best way to change the order back to the original? So as long as I refer the columns by names (preferred method) or in the case of index based method use something like:
df = df.loc[:, ['col_1', 'col_2', 'col_3']]
, I will not need to worry about python doing strange things automatically and unexpectedly behind my back?
Reply
#8
(Feb-14-2020, 02:21 PM)new_to_python Wrote: I will not need to worry about python doing strange things automatically and unexpectedly behind my back?
I don't think you should consider these as something mysterious Python behavior; dictionaries are always considered as unordered key,value containers, so you cannot rely on item order in dictionaries prior v.3.7; In case of stack and unstack operations, it is concerned Pandas and how these methods implemented. As you noted above, if you really need it, you can always reorder columns manually.
Reply
#9
Thanks. I am trying to reorder the columns of DataFrame in line[236] of Post #5 manually. I did:

A = df.unstack('state').stack('side') 
A = A[['Ohio', 'Colorado']]
But I got ['Colorado'] not in index error. Could you please tell me how to fix it?
Reply
#10
(Feb-15-2020, 02:05 PM)new_to_python Wrote: But I got ['Colorado'] not in index error. Could you please tell me how to fix it?
A has a multiindex for columns, so you need something like this:

A_new = A.reindex(['Ohio', 'Colorado'], axis=1, levels=1)
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  How to add columns to polars dataframe sayyedkamran 1 1,689 Nov-03-2023, 03:01 PM
Last Post: gulshan212
  concat 3 columns of dataframe to one column flash77 2 776 Oct-03-2023, 09:29 PM
Last Post: flash77
  Convert several columns to int in dataframe Krayna 2 2,362 May-21-2021, 08:55 AM
Last Post: Krayna
  Outputs "NaN" after "DataFrame columns" function? epsilon 7 3,572 Jan-27-2021, 10:59 AM
Last Post: epsilon
  Adapting a dataframe to the some of columns flyway 2 2,032 Aug-12-2020, 07:21 AM
Last Post: flyway
  Difference of two columns in Pandas dataframe zinho 2 3,313 Jun-17-2020, 03:36 PM
Last Post: zinho
  DataFrame: To print a column value which is not null out of 5 columns mani 2 2,077 Mar-18-2020, 06:07 AM
Last Post: mani
Question Dividing a single column of dataframe into multiple columns based on char length darpInd 2 2,417 Mar-14-2020, 09:19 AM
Last Post: scidam
  Interate for loop over certain columns in dataframe Finpyth 2 1,919 Mar-06-2020, 08:34 AM
Last Post: Finpyth
  How to highlight dataframe columns SriRajesh 1 1,825 Feb-28-2020, 11:02 PM
Last Post: Marbelous

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020