Camelot not able to change column header

Camelot not able to change column header - Printable Version

+- Python Forum (https://python-forum.io)
+-- Forum: Python Coding (https://python-forum.io/forum-7.html)
+--- Forum: General Coding Help (https://python-forum.io/forum-8.html)
+--- Thread: Camelot not able to change column header (/thread-44147.html)

Camelot not able to change column header - shortmeister1 - Mar-21-2025

Good morning clever people.

Firstly, I'm a newbie to Python, with a background in VBA, Power Query, SQL etc. etc.

I've read a few books and am trying to put it into practice with a real-world personal project that I wrote a couple of years ago in DAX to extract data from my unhelpfully variable pdf bank statements. After various shouting matches with the drunken fool called ChatGPT, I got the program up and running on a single pdf. Here's where it starts to unravel. Each time I run the initial extraction subroutine on a second pdf, it gives me the following column headers:

['Date', 'TransType', 'Description', 'Paid In', 'Paid Out', '7'] rather than ['Date', 'TransType', 'Description', 'Paid In', 'Paid Out', 'Balance']

I do not know how to fix this, although I think I know the cause. The bank statement has several tables ranging from small tables Name, Address, Previous Balance etc. (all stuff that I successfully drop in subsequent cleaning routines) to the actual statement with the columns as listed above. I think it's setting the columns from the first smaller table that it hits and then Camelot gets confused and won't let me change the df headers.

This is what the "working" extraction routine looks like:

# Extract PDF tables using Camelot
def extract_from_pdfs(pdf_folder):
    combined_df = pd.DataFrame()
    pattern = re.compile(r'NatWestCurrent\d+\.pdf', re.IGNORECASE)

    for filename in os.listdir(pdf_folder):
        if pattern.match(filename):
            pdf_path = os.path.join(pdf_folder, filename)
            print(f'Processing {filename}')

            tables = camelot.read_pdf(pdf_path, pages='all', strip_text='\n', flavor='stream')

            for table in tables:
                temp_df = table.df
                # Add source filename to dataframe
                temp_df.insert(0, 'Source', filename)
                # Append to master dataframe
                combined_df = pd.concat([combined_df, temp_df], ignore_index=True)

            # Overwrite column names with integer ordinals
            combined_df.columns = range(temp_df.shape[1])
            #Rename ordinals to proper column names
            combined_df = combined_df.rename(columns={0: 'Source', 1: 'Date', 2: 'TransType', 3: 'Description', 4: 'Paid In', 5: 'Paid Out', 6: 'Balance'})

    return combined_df

In order to accommodate the second pdf, ChatGPT took several hours to change this to:

def extract_from_pdfs(pdf_folder):
    combined_df = pd.DataFrame()
    pattern = re.compile(r'NatWestCurrent\d+\.pdf', re.IGNORECASE)

    for filename in os.listdir(pdf_folder):
        if pattern.match(filename):
            pdf_path = os.path.join(pdf_folder, filename)
            print(f'Processing {filename}')

            tables = camelot.read_pdf(pdf_path, pages='all', strip_text='\n', flavor='stream')

            for table in tables:
                temp_df = table.df

                # Add blank columns if fewer than 6 columns
                while temp_df.shape[1] < 6:
                    temp_df[temp_df.shape[1]] = ''

                # Force to exactly 6 columns
                temp_df = temp_df.iloc[:, :6]

                # Add source filename column (now total 7 columns)
                temp_df.insert(0, 'Source', filename)

                # Drop junk rows (TransType = 'Description' and Balance = '7')
                temp_df = temp_df[
                    ~((temp_df.iloc[:, 2] == 'Description') & (temp_df.iloc[:, 6] == '7'))
                ].reset_index(drop=True)

                # Append to combined DataFrame
                combined_df = pd.concat([combined_df, temp_df], ignore_index=True)

    # Set final column names
    combined_df.columns = ['Source', 'Date', 'TransType', 'Description', 'Paid In', 'Paid Out', 'Balance']

    return combined_df

Much to the drunken AI's surprise, well over 4 hours of asking it to fix its code has failed to produce any improvement.

I've ground to a halt here and would really appreciate some advice on how to change the df column headers.

Thanks
Martin

RE: Camelot not able to change column header - deanhystad - Mar-21-2025

AI's expect us to demonstrate intelligence too. Garbage in, garbage out as the old adage says.

Can you describe what it is that you are trying to do? The code says you want to make a dataframe from information in multiple PDF files. Is that correct? If so, how are these PDF files related?

If a PDF has multiple tables, do you want information from each of the tables or only a particular table. If the latter, how can that table be identified? If the prior, how are the different tables related?

RE: Camelot not able to change column header - Pedroski55 - Mar-22-2025

Post a couple of example pdfs. We can try to do what you want.

I have never tried using camelot.

pymupdf, aka fitz is very good at getting tables from pdfs, you might try that.

pdfs are tricky, as you have discovered!

An example of pymupdf at work from an old question here:

import pymupdf

# a 1 page pdf in Cyrillic text
path2pdf = '/home/pedro/Downloads/a.pdf'

doc = pymupdf.open(path2pdf)
for page in doc:
    # tabs will be a list of lists representing the table each row is a list
    tabs = page.find_tables()
    if tabs.tables:
        print(tabs[0].extract())

tab = tabs[0]
# never did get this to display the Cyrillic text correctly as you see it in the pdf!
for line in tab.extract(encoding=pymupdf.TEXT_ENCODING_CYRILLIC):  # print cell text for each row
    print(f'This line has {len(line)} cells')
    print(line)

For example, this was the first line:

Output:This line has 3 cells
['01 О ɮиɪɦ еɧɧɨɦ ɧа иɦе ɧɨва ɧии (ɧа иɦе ɧɨва ɧии) заɫ ɬɪɨй ɳи ка, ɦеɫ ɬе ɧа ɯɨж де ɧиɹ заɫ ɬɪɨй ɳи ка, ɪе жиɦе\nегɨ ɪа бɨɬɵ, ɧɨɦ еɪе ɬе леɮɨ ɧа, ад ɪе ɫе ɨɮи ɰи алɶ ɧɨ гɨ ɫай ɬа заɫ ɬɪɨй ɳи ка и ад ɪе ɫе ɷлекɬ ɪɨɧ ɧɨй ɩɨɱ ɬɵ (ɩɪи\nɧа лиɱии), ɮа ɦилии, иɦе ɧи, ɨɬ ɱеɫ ɬве (еɫ ли иɦе еɬ ɫɹ) ли ɰа, иɫ ɩɨлɧɹ ɸɳе гɨ ɮɭɧ кɰии еди ɧɨлиɱ ɧɨ гɨ иɫ -\nɩɨлɧи ɬелɶ ɧɨ гɨ ɨɪ га ɧа заɫ ɬɪɨй ɳи ка, а ɬак же ɨб иɧ ди видɭа ли зиɪɭ ɸɳеɦ заɫ ɬɪɨй ɳи ка кɨɦ ɦеɪɱеɫ кɨɦ ɨбɨз -\nɧаɱ еɧии, еɫ ли заɫ ɬɪɨй ɳик ɩла ɧиɪɭе ɬ иɫ ɩɨлɶ зɨ ваɬɶ ɬа кɨе кɨɦ ɦеɪɱеɫ кɨе ɨбɨз ɧаɱ еɧие в ɪек ла ɦе, ɫвɹ заɧ -\nɧɨй ɫ ɩɪивл еɱ еɧи еɦ де ɧеж ɧɵɯ ɫɪедɫɬв ɭɱаɫ ɬɧи кɨв дɨл евɨ гɨ ɫɬɪɨ иɬелɶ ɫɬва', None, None]