Bottom Page

Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
 Does the order of columns in the DataFrame matter?
#1
Hi, I noticed that sometimes when doing certain operations, the order of the columns in the DataFrame got changed automatically. Same thing happened when I tried out some examples in books. Using the same commands, my DataFrames have the columns shown in different orders as those shown in the books. Does the order of DataFrame matters in Python/pandas/numpy?
Quote
#2
Yes, it does.
Look at the following example

import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.rand(10, 5)) 
print(df.values)  # print corresponding numpy array
print(df[[2,1,3,4,0]].values) # reorder columns and print
The answer depends on how you are accessing data in data-frame. If you access columns by names, e.g. df.loc[:, 'some_name'] and never use index-based access, e.g. something like df.iloc[:, 4], you can not worry about the order of columns.
new_to_python likes this post
Quote
#3
Thanks. I came across the following example:

# Example 2

In [194]: lefth = pd.DataFrame({'key1': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada'], 
     ...:                       'key2': [2000, 2001, 2002, 2001, 2002], 
     ...:                       'data': np.arange(5.)})  
In [196]: lefth                                                                                           
Out[196]: 
     key1  key2  data
0    Ohio  2000   0.0
1    Ohio  2001   1.0
2    Ohio  2002   2.0
3  Nevada  2001   3.0
4  Nevada  2002   4.0
As indicated above, on my machine the columns are listed as key1, key2 and data which seems to be according to the order I entered the columns in the pd.DataFrame command. However, the person who made this example has the columns displayed as data followed by key1 and key2 using the same command. How come? I don't quite remember well but I think somebody mentioned that depending on the version python is used, the columns could be arranged differently. Is this true?

Does that mean it is always better to access the columns by names because the order of columns could be arranged differently for unknown reason and people could obtain different results or even errors when using the index-based access method?
Quote
#4
(Feb-13-2020, 05:12 PM)new_to_python Wrote: As indicated above, on my machine the columns are listed as key1, key2 and data which seems to be according to the order I entered the columns in the pd.DataFrame command. However, the person who made this example has the columns displayed as data followed by key1 and key2 using the same command. How come? I don't quite remember well but I think somebody mentioned that depending on the version python is used, the columns could be arranged differently. Is this true?

This depends on implementation of dict data structure in Python. Prior to Python 3.6 dict structure was unordered "(key, val)" structure.
So, when iterating over dict you can theoretically get different order of items (at least, for different Python versions, implementations), and, therefore, this lead to different order of columns in Pandas dataframe. However, since CPython 3.6+ (or Python 3.7+ for any other implementation of Python), dict preserves the order of item insertion.

In general, to be sure the order of columns is correct, you can always do:
df = df.loc[:, ['col_1', 'col_2', 'col_3']]
After that, you can rely on your particular order of columns and access them by integer incidences.
new_to_python likes this post
Quote
#5
Thank you very much.

By the way, what do you think is the cause of the re-ordering between Ohio and Colorado from Step 235 to 236?

In [234]: df                                                                                              
Out[234]: 
side             left  right
state    number             
Ohio     one        0      5
         two        1      6
         three      2      7
Colorado one        3      8
         two        4      9
         three      5     10

In [235]: df.unstack('state')                                                                                                                                                        
Out[235]: 
side   left          right         
state  Ohio Colorado  Ohio Colorado
number                             
one       0        3     5        8
two       1        4     6        9
three     2        5     7       10

In [236]: df.unstack('state').stack('side')                                                               
Out[236]: 
state         Colorado  Ohio
number side                 
one    left          3     0
       right         8     5
two    left          4     1
       right         9     6
three  left          5     2
       right        10     7
Quote
#6
I cannot answer exactly, but I think this is because some sorting operation is applied to the index used in .unstack or .stack.
If you look at _Unstacker implementation, you can find that it includes some sorting operations are being applied in different places of the code.
This is likely the cause of the reordering.
Quote
#7
Thanks scidam. In this case, what is the best way to change the order back to the original? So as long as I refer the columns by names (preferred method) or in the case of index based method use something like:
df = df.loc[:, ['col_1', 'col_2', 'col_3']]
, I will not need to worry about python doing strange things automatically and unexpectedly behind my back?
Quote
#8
(Feb-14-2020, 02:21 PM)new_to_python Wrote: I will not need to worry about python doing strange things automatically and unexpectedly behind my back?
I don't think you should consider these as something mysterious Python behavior; dictionaries are always considered as unordered key,value containers, so you cannot rely on item order in dictionaries prior v.3.7; In case of stack and unstack operations, it is concerned Pandas and how these methods implemented. As you noted above, if you really need it, you can always reorder columns manually.
Quote
#9
Thanks. I am trying to reorder the columns of DataFrame in line[236] of Post #5 manually. I did:

A = df.unstack('state').stack('side') 
A = A[['Ohio', 'Colorado']]
But I got ['Colorado'] not in index error. Could you please tell me how to fix it?
Quote
#10
(Feb-15-2020, 02:05 PM)new_to_python Wrote: But I got ['Colorado'] not in index error. Could you please tell me how to fix it?
A has a multiindex for columns, so you need something like this:

A_new = A.reindex(['Ohio', 'Colorado'], axis=1, levels=1)
Quote

Top Page

Possibly Related Threads...
Thread Author Replies Views Last Post
  Pandas dataframe columns collapsed in Spyder when printing UniKlixX 2 293 Nov-04-2019, 07:00 AM
Last Post: UniKlixX
  How modify the DataFrame columns SriRajesh 2 272 Sep-12-2019, 03:14 PM
Last Post: SriRajesh
  Double 'for' loop and writing in a new columns dataframe marco_ita 0 282 Sep-07-2019, 12:44 PM
Last Post: marco_ita
  Creating A List of DataFrames & Manipulating Columns in Each DataFrame firebird 1 289 Jul-31-2019, 04:04 AM
Last Post: scidam
  [pandas] How to re-arrange DataFrame columns SriMekala 8 1,179 Jun-22-2019, 12:55 AM
Last Post: scidam
  Selecting Few Columns from a dataframe Shivi_Bhatia 2 569 Mar-24-2019, 12:20 PM
Last Post: Shivi_Bhatia
  compare and modify columns in dataframe DionisiO 1 443 Feb-23-2019, 11:07 PM
Last Post: tiredAcademic
  Get rows with same value from dataframe of particular columns angelwings 1 832 Apr-11-2018, 02:40 AM
Last Post: scidam
  Stack dataframe columns into rows klllmmm 0 1,279 Sep-03-2017, 02:26 AM
Last Post: klllmmm

Forum Jump:


Users browsing this thread: 1 Guest(s)