Camelot not able to change column header

shortmeister1 · Mar-21-2025, 06:36 AM

Good morning clever people.

Firstly, I'm a newbie to Python, with a background in VBA, Power Query, SQL etc. etc.

I've read a few books and am trying to put it into practice with a real-world personal project that I wrote a couple of years ago in DAX to extract data from my unhelpfully variable pdf bank statements. After various shouting matches with the drunken fool called ChatGPT, I got the program up and running on a single pdf. Here's where it starts to unravel. Each time I run the initial extraction subroutine on a second pdf, it gives me the following column headers:

['Date', 'TransType', 'Description', 'Paid In', 'Paid Out', '7'] rather than ['Date', 'TransType', 'Description', 'Paid In', 'Paid Out', 'Balance']

I do not know how to fix this, although I think I know the cause. The bank statement has several tables ranging from small tables Name, Address, Previous Balance etc. (all stuff that I successfully drop in subsequent cleaning routines) to the actual statement with the columns as listed above. I think it's setting the columns from the first smaller table that it hits and then Camelot gets confused and won't let me change the df headers.

This is what the "working" extraction routine looks like:

# Extract PDF tables using Camelot
def extract_from_pdfs(pdf_folder):
    combined_df = pd.DataFrame()
    pattern = re.compile(r'NatWestCurrent\d+\.pdf', re.IGNORECASE)

    for filename in os.listdir(pdf_folder):
        if pattern.match(filename):
            pdf_path = os.path.join(pdf_folder, filename)
            print(f'Processing {filename}')

            tables = camelot.read_pdf(pdf_path, pages='all', strip_text='\n', flavor='stream')

            for table in tables:
                temp_df = table.df
                # Add source filename to dataframe
                temp_df.insert(0, 'Source', filename)
                # Append to master dataframe
                combined_df = pd.concat([combined_df, temp_df], ignore_index=True)

            # Overwrite column names with integer ordinals
            combined_df.columns = range(temp_df.shape[1])
            #Rename ordinals to proper column names
            combined_df = combined_df.rename(columns={0: 'Source', 1: 'Date', 2: 'TransType', 3: 'Description', 4: 'Paid In', 5: 'Paid Out', 6: 'Balance'})

    return combined_df

In order to accommodate the second pdf, ChatGPT took several hours to change this to:

def extract_from_pdfs(pdf_folder):
    combined_df = pd.DataFrame()
    pattern = re.compile(r'NatWestCurrent\d+\.pdf', re.IGNORECASE)

    for filename in os.listdir(pdf_folder):
        if pattern.match(filename):
            pdf_path = os.path.join(pdf_folder, filename)
            print(f'Processing {filename}')

            tables = camelot.read_pdf(pdf_path, pages='all', strip_text='\n', flavor='stream')

            for table in tables:
                temp_df = table.df

                # Add blank columns if fewer than 6 columns
                while temp_df.shape[1] < 6:
                    temp_df[temp_df.shape[1]] = ''

                # Force to exactly 6 columns
                temp_df = temp_df.iloc[:, :6]

                # Add source filename column (now total 7 columns)
                temp_df.insert(0, 'Source', filename)

                # Drop junk rows (TransType = 'Description' and Balance = '7')
                temp_df = temp_df[
                    ~((temp_df.iloc[:, 2] == 'Description') & (temp_df.iloc[:, 6] == '7'))
                ].reset_index(drop=True)

                # Append to combined DataFrame
                combined_df = pd.concat([combined_df, temp_df], ignore_index=True)

    # Set final column names
    combined_df.columns = ['Source', 'Date', 'TransType', 'Description', 'Paid In', 'Paid Out', 'Balance']

    return combined_df

Much to the drunken AI's surprise, well over 4 hours of asking it to fix its code has failed to produce any improvement.

I've ground to a halt here and would really appreciate some advice on how to change the df column headers.

Thanks
Martin

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	Python create a spreadsheet with column and row header	ouruslife	4	4,052	Jul-09-2022, 11:01 AM Last Post: Pedroski55
	How to keep columns header on excel without change after export data to excel file?	ahmedbarbary	0	1,742	May-03-2022, 05:46 PM Last Post: ahmedbarbary
	Openpyxl-change value of cells in column based on value that currently occupies cells	phillipaj1391	5	15,707	Mar-30-2022, 11:05 PM Last Post: Pedroski55
	Want to replace last column header value	shantanu97	0	2,125	Mar-25-2021, 06:35 AM Last Post: shantanu97
	How to change row 2 to column header within a dataframe	sparkt	2	2,843	Aug-20-2020, 05:12 PM Last Post: sparkt
	How to change a dataframe column to lower case	zhujp98	5	9,628	Oct-29-2019, 06:42 PM Last Post: zhujp98
	How to switch table area coordinates in Python Camelot and Tabula-Py	john5	0	5,107	May-08-2019, 04:31 PM Last Post: john5
	change array column values without loop	khalidreemy	2	5,339	May-05-2019, 09:05 AM Last Post: DeaD_EyE
	Change Column Type	Talch	0	2,597	Aug-16-2018, 03:02 PM Last Post: Talch
	Change type of elements in a list by column	tkj80	7	8,358	Jan-04-2017, 11:15 PM Last Post: tkj80

Camelot not able to change column header

User Panel Messages

Announcements