Nov-02-2021, 12:24 PM
Hi
I have code which works to extract data from tables in a PDF. The code puts the data into columns and transfers to a CSV file. The code works but I have a few problems I need some help with. From the first column in the table I needed to create a hierarchical system so I can filter the data to find specific items. I have attached a photo
I have a couple of problems with my code:
1. The level 1 data is using any data in UPPERCASE and splitting into a new column but returning items with numbers, how can i disregard numbers when using .isupper()
2. I need a level 2 but finding it difficult to get a code which can recognise bold text in the table and split that data into a column. Any ideas what i could use?
I have code which works to extract data from tables in a PDF. The code puts the data into columns and transfers to a CSV file. The code works but I have a few problems I need some help with. From the first column in the table I needed to create a hierarchical system so I can filter the data to find specific items. I have attached a photo
I have a couple of problems with my code:
1. The level 1 data is using any data in UPPERCASE and splitting into a new column but returning items with numbers, how can i disregard numbers when using .isupper()
2. I need a level 2 but finding it difficult to get a code which can recognise bold text in the table and split that data into a column. Any ideas what i could use?
# Determine hierarchy for i, row in df_combine.iterrows(): # Level 1: if its all in uppercase it is a new level 1 hierarchy if df_combine['Item'][i].isupper(): df_combine.loc[i, 'Level1'] = df_combine['Item'][i] # Otherwise use the previous level 1 heirarchy elif i>0: df_combine.loc[i, 'Level1'] = df_combine['Level1'][i-1] # Future development: logic to determine level 2 heirarchy # Level 3: If it's not all uppercase, but the first character is it is a level 3 heirarchy if (not df_combine['Item'][i].isupper()) & (df_combine['Item'][i][0].isupper()): try: # If the next 2 rows are all lower, but it doesn't have a rate: join it to the first row above if (not df_combine['Item'][i+1][0].isupper()) & (not df_combine['Total Rate£'][i+1]==df_combine['Total Rate£'][i+1]) & (not df_combine['Item'][i+2][0].isupper()) & (not df_combine['Total Rate£'][i+2]==df_combine['Total Rate£'][i+2]): df_combine.loc[i, 'Level3'] = df_combine['Item'][i] + ' ' + df_combine['Item'][i+1]+ ' ' + df_combine['Item'][i+2] # else if the next row is all lower, but it doesn't have a rate: join it to the row above elif (not df_combine['Item'][i+1][0].isupper()) & (not df_combine['Total Rate£'][i+1]==df_combine['Total Rate£'][i+1]): df_combine.loc[i, 'Level3'] = df_combine['Item'][i] + ' ' + df_combine['Item'][i+1] # else level 3 is just a one-liner else: df_combine.loc[i, 'Level3'] = df_combine['Item'][i] except: pass # If it doesn't have a level 3, use the one above for i, row in df_combine.iterrows(): if (not df_combine['Level3'][i]==df_combine['Level3'][i]) & (i>0): df_combine.loc[i, 'Level3'] = df_combine['Level3'][i-1]