Pandas question

DPaul · Mar-27-2023, 03:32 PM

Hi,
I now have been handed a bunch of legacy excel files.
The author had the good idea to put two header rows, but both have column names i want.
So I do:

xls = panda.read_excel(exel,header = 1,sheet_name = 'Sheet1')

But: some headers take up the 2 rows (0 and 1), others have something in row zero, and something in row 1 , like this:
..... | Family Name and firstnames..... |
NR | Name | Firstname | second name | etc...

When I specify header = 0, I can't get the second row names , when header = 1 the "NR" field is not recognised.
This happens 4 times. in the header rows. All excels are like this.
Remedy ?
Or skip the headers?
thx,
Paul

farshid · (This post was last modified: Apr-20-2023, 06:06 PM by buran.)

It sounds like the headers in your Excel files have some inconsistencies in their formatting, which is causing issues when trying to read them into Pandas. One possible solution is to use the skiprows parameter in pandas.read_excel() to skip the first row of headers, and then manually add the second row of headers using the columns parameter.

Here's an example:

import pandas as pd

# Read in the Excel file, skipping the first row of headers
df = pd.read_excel(excel_file, sheet_name='Sheet1', skiprows=[0])

# Manually specify the second row of headers using the 'columns' parameter
df.columns = ['NR', 'Name', 'Firstname', 'Second Name', 'etc...']

# Do whatever data cleaning or analysis you need to do on the resulting DataFrame

This should allow you to read in the Excel files while handling the inconsistencies in the header formatting. If you encounter additional issues with the headers, you may need to adjust the skiprows and columns parameters accordingly.

Alternatively, if you don't need the header information for your analysis, you could simply skip the headers altogether by setting header=None in pandas.read_excel(). This will result in Pandas automatically generating column names for you (e.g. "Unnamed: 0", "Unnamed: 1", etc.), which you can then rename as needed using df.rename().

**deanhystad** · (This post was last modified: Apr-20-2023, 08:42 PM by deanhystad.)

Is your question that you have multi-indexed columns and you don't know how to use them? To index multi-indexed columns, provide a value for each index. Thes example has two index values per column.

import pandas as pd
import numpy as np

col = pd.MultiIndex.from_arrays(
    [["one", "one", "one", "two", "two", "two"], ["a", "b", "c", "a", "b", "c"]]
)
data = np.array(range(2 * 3 * 4)).reshape((4, 2 * 3))

df = pd.DataFrame(data, columns=col)
print(df)

print(df[("two", "a")])

Output:  one         two
    a   b   c   a   b   c
0   0   1   2   3   4   5
1   6   7   8   9  10  11
2  12  13  14  15  16  17
3  18  19  20  21  22  23
0     3
1     9
2    15
3    21
Name: (two, a), dtype: int32

To verify this works when reading from an excel file, I wrote the dataframe to excel and read the file using this progrram.

import pandas as pd

df = pd.read_excel("test.xlsx", header=(0, 1))
print(df)

print(df[("two", "a")])

Output:  one         two
    a   b   c   a   b   c
0   0   1   2   3   4   5
1   6   7   8   9  10  11
2  12  13  14  15  16  17
3  18  19  20  21  22  23
0     3
1     9
2    15
3    21
Name: (two, a), dtype: int64

You could also collapse the columns to be single-indexed.

def collapse_header(columns):
    """Collapse multi-index column indices to a list of strings."""
    return [" ".join(column).strip() for column in columns]


df = pd.read_excel("test.xlsx", header=(0, 1))
print(df)

df.columns = collapse_header(df.columns)
print(df)

print(df["two a"])

Output:  one         two
    a   b   c   a   b   c
0   0   1   2   3   4   5
1   6   7   8   9  10  11
2  12  13  14  15  16  17
3  18  19  20  21  22  23
   one a  one b  one c  two a  two b  two c
0      0      1      2      3      4      5
1      6      7      8      9     10     11
2     12     13     14     15     16     17
3     18     19     20     21     22     23
0     3
1     9
2    15
3    21
Name: two a, dtype: int64

DPaul · Apr-22-2023, 05:51 AM

This is an old question, the problem was:
- The original author created "good looking" column headers: 2 rows with nice colors.
- But columns need only 1 header name:
So: sometimes it was in row 0, sometimes in row 1.
Sometimes in both rows, when the top row title "groups" a few columns from row 1. (Like "name", groups "first name, family name" below)

I had speculated you could explain to pandas which header name to pick alternatively from row 0 or 1, but that does not work.
As suggested manual correction is always possible, but not for a few hundred excels.

I skipped the headers, (re)named them programmatically, and did not look back.
Paul

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	How is pandas modifying all rows in an assignment - python-newbie question	markm74	1	707	Nov-28-2023, 10:36 PM Last Post: deanhystad
	Question on pandas.dataframe merging two colums	shomikc	4	849	Jun-29-2023, 11:30 AM Last Post: snippsat

Pandas question

User Panel Messages

Announcements