Python Forum
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Pandas question
#1
Hi,
I now have been handed a bunch of legacy excel files.
The author had the good idea to put two header rows, but both have column names i want.
So I do:
xls = panda.read_excel(exel,header = 1,sheet_name = 'Sheet1')
But: some headers take up the 2 rows (0 and 1), others have something in row zero, and something in row 1 , like this:
..... | Family Name and firstnames..... |
NR | Name | Firstname | second name | etc...

When I specify header = 0, I can't get the second row names , when header = 1 the "NR" field is not recognised.
This happens 4 times. in the header rows. All excels are like this.
Remedy ?
Or skip the headers?
thx,
Paul
It is more important to do the right thing, than to do the thing right.(P.Drucker)
Better is the enemy of good. (Montesquieu) = French version for 'kiss'.
Reply
#2
It sounds like the headers in your Excel files have some inconsistencies in their formatting, which is causing issues when trying to read them into Pandas. One possible solution is to use the skiprows parameter in pandas.read_excel() to skip the first row of headers, and then manually add the second row of headers using the columns parameter.

Here's an example:

import pandas as pd

# Read in the Excel file, skipping the first row of headers
df = pd.read_excel(excel_file, sheet_name='Sheet1', skiprows=[0])

# Manually specify the second row of headers using the 'columns' parameter
df.columns = ['NR', 'Name', 'Firstname', 'Second Name', 'etc...']

# Do whatever data cleaning or analysis you need to do on the resulting DataFrame
This should allow you to read in the Excel files while handling the inconsistencies in the header formatting. If you encounter additional issues with the headers, you may need to adjust the skiprows and columns parameters accordingly.

Alternatively, if you don't need the header information for your analysis, you could simply skip the headers altogether by setting header=None in pandas.read_excel(). This will result in Pandas automatically generating column names for you (e.g. "Unnamed: 0", "Unnamed: 1", etc.), which you can then rename as needed using df.rename().
Reply
#3
Is your question that you have multi-indexed columns and you don't know how to use them? To index multi-indexed columns, provide a value for each index. Thes example has two index values per column.
import pandas as pd
import numpy as np

col = pd.MultiIndex.from_arrays(
    [["one", "one", "one", "two", "two", "two"], ["a", "b", "c", "a", "b", "c"]]
)
data = np.array(range(2 * 3 * 4)).reshape((4, 2 * 3))

df = pd.DataFrame(data, columns=col)
print(df)

print(df[("two", "a")])
Output:
one two a b c a b c 0 0 1 2 3 4 5 1 6 7 8 9 10 11 2 12 13 14 15 16 17 3 18 19 20 21 22 23 0 3 1 9 2 15 3 21 Name: (two, a), dtype: int32
To verify this works when reading from an excel file, I wrote the dataframe to excel and read the file using this progrram.
import pandas as pd

df = pd.read_excel("test.xlsx", header=(0, 1))
print(df)

print(df[("two", "a")])
Output:
one two a b c a b c 0 0 1 2 3 4 5 1 6 7 8 9 10 11 2 12 13 14 15 16 17 3 18 19 20 21 22 23 0 3 1 9 2 15 3 21 Name: (two, a), dtype: int64
You could also collapse the columns to be single-indexed.
def collapse_header(columns):
    """Collapse multi-index column indices to a list of strings."""
    return [" ".join(column).strip() for column in columns]


df = pd.read_excel("test.xlsx", header=(0, 1))
print(df)

df.columns = collapse_header(df.columns)
print(df)

print(df["two a"])
Output:
one two a b c a b c 0 0 1 2 3 4 5 1 6 7 8 9 10 11 2 12 13 14 15 16 17 3 18 19 20 21 22 23 one a one b one c two a two b two c 0 0 1 2 3 4 5 1 6 7 8 9 10 11 2 12 13 14 15 16 17 3 18 19 20 21 22 23 0 3 1 9 2 15 3 21 Name: two a, dtype: int64
Reply
#4
This is an old question, the problem was:
- The original author created "good looking" column headers: 2 rows with nice colors.
- But columns need only 1 header name:
So: sometimes it was in row 0, sometimes in row 1.
Sometimes in both rows, when the top row title "groups" a few columns from row 1. (Like "name", groups "first name, family name" below)

I had speculated you could explain to pandas which header name to pick alternatively from row 0 or 1, but that does not work.
As suggested manual correction is always possible, but not for a few hundred excels.

I skipped the headers, (re)named them programmatically, and did not look back.
Paul
It is more important to do the right thing, than to do the thing right.(P.Drucker)
Better is the enemy of good. (Montesquieu) = French version for 'kiss'.
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  How is pandas modifying all rows in an assignment - python-newbie question markm74 1 707 Nov-28-2023, 10:36 PM
Last Post: deanhystad
  Question on pandas.dataframe merging two colums shomikc 4 849 Jun-29-2023, 11:30 AM
Last Post: snippsat

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020