Python Forum

Full Version: Extracting Data from tables
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Hi

I have code which works to extract data from tables in a PDF. The code puts the data into columns and transfers to a CSV file. The code works but I have a few problems I need some help with. From the first column in the table I needed to create a hierarchical system so I can filter the data to find specific items. I have attached a photo

I have a couple of problems with my code:

1. The level 1 data is using any data in UPPERCASE and splitting into a new column but returning items with numbers, how can i disregard numbers when using .isupper()

2. I need a level 2 but finding it difficult to get a code which can recognise bold text in the table and split that data into a column. Any ideas what i could use?

# Determine hierarchy 
for i, row in df_combine.iterrows():
    # Level 1: if its all in uppercase it is a new level 1 hierarchy
    if df_combine['Item'][i].isupper():
        df_combine.loc[i, 'Level1'] = df_combine['Item'][i]
    # Otherwise use the previous level 1 heirarchy
    elif i>0:
        df_combine.loc[i, 'Level1'] = df_combine['Level1'][i-1]
    
    # Future development: logic to determine level 2 heirarchy
    
    # Level 3: If it's not all uppercase, but the first character is it is a level 3 heirarchy
    if (not df_combine['Item'][i].isupper()) & (df_combine['Item'][i][0].isupper()):
        try:
            # If the next 2 rows are all lower, but it doesn't have a rate: join it to the first row above
            if (not df_combine['Item'][i+1][0].isupper()) & (not df_combine['Total Rate£'][i+1]==df_combine['Total Rate£'][i+1]) & (not df_combine['Item'][i+2][0].isupper()) & (not df_combine['Total Rate£'][i+2]==df_combine['Total Rate£'][i+2]):
                df_combine.loc[i, 'Level3'] = df_combine['Item'][i] + ' ' + df_combine['Item'][i+1]+ ' ' + df_combine['Item'][i+2]
            # else if the next row is all lower, but it doesn't have a rate: join it to the row above
            elif (not df_combine['Item'][i+1][0].isupper()) & (not df_combine['Total Rate£'][i+1]==df_combine['Total Rate£'][i+1]):
                df_combine.loc[i, 'Level3'] = df_combine['Item'][i] + ' ' + df_combine['Item'][i+1]
            # else level 3 is just a one-liner
            else:
                df_combine.loc[i, 'Level3'] = df_combine['Item'][i] 
        except:
            pass
# If it doesn't have a level 3, use the one above
for i, row in df_combine.iterrows():
    if (not df_combine['Level3'][i]==df_combine['Level3'][i]) & (i>0):
        df_combine.loc[i, 'Level3'] = df_combine['Level3'][i-1]