extract table from multiple pages - Printable Version +- Python Forum (https://python-forum.io) +-- Forum: Python Coding (https://python-forum.io/forum-7.html) +--- Forum: General Coding Help (https://python-forum.io/forum-8.html) +--- Thread: extract table from multiple pages (/thread-37870.html) |
extract table from multiple pages - sshree43 - Jul-31-2022 Hi Expert, I am trying to get table extract from multiple pdf pages but i am getting only 2 pages and page header currently(Source PDF(test.pdf),output.csv file, codetext.txt are added as attachment Expectation: it should read the entire data from PDF. Currently it is reading only partial data Here is my code import tabula import requests import csv import pandas as pd import re import parse import pdfplumber from collections import namedtuple import datetime from datetime import date import os import glob import shutil from os import path # using pdminer i am extracting all the post name , grade name and month repporting to add to this cleaned data frame. # ------------------------------------File name file = "C:\\Users\\xxx\\Downloads\\test.pdf" lines = [] pnames = [] gnames = [] mreports = [] with pdfplumber.open(file) as pdf: for page in pdf.pages: try: text = page.extract_text() except: text = '' if text is not None: liness = text.split('\n') lines += liness for li in lines: if "Port:" in li: li = li.replace("Port:", "").strip() li_new = li.split("Month Reporting:")[-0].strip() m_repor = li.split("Month Reporting:")[-1].strip() if "Grade Name:" in li_new: g_name = li_new.split("Grade Name:")[-1].strip() p_name = li_new.split("Grade Name:")[0].strip() print(li_new) else: g_name = li_new.split()[1:] g_name = ' '.join(g_name).strip() p_name = li_new.split()[0].strip() pnames.append(p_name) gnames.append(g_name) mreports.append(m_repor) print("PortName: ", len(pnames)) print("GradeName: ", len(gnames)) print("MonthReporting: ", len(mreports)) # i am using tabula to extract all the tables from pdf and this table is cleaned for final joining. df = tabula.read_pdf(file, pages='all') final_list = [ ["PORT NAME", "GRADE NAME", "MONTH REPORTING", "BL DATE", "VESSEL", "DESTINATION", "CHARTERERS", "API"]] # final_list=[] print(final_list) last_df = len(df) print("Length of tables: ", last_df) for i in range(0, len(pnames)): op_df = df[i] op_df = op_df.dropna(how='all') op_df_list = op_df.values.tolist() for li in op_df_list: if str(li[0]) == "nan": li = li[1:] else: print("check this case") print(li) li.insert(0, pnames[i]) li.insert(1, gnames[i]) li.insert(2, mreports[i]) print(li) if "BL Date" in li: pass else: final_list.append(li) df_2 = pd.DataFrame(final_list) df_2.columns = df_2.iloc[0] df_2 = df_2[1:] max_row=len(df_2) curr_date = datetime.datetime.now() created_date = curr_date.strftime('%d-%b-%y') for row in range(max_row): df_2['created_by'] = 'created by' df_2['created_date'] = created_date print(df_2) df_2.rename( columns={'PORT NAME': 'port_name', 'GRADE NAME': 'crude', 'MONTH REPORTING': 'reporting_month', 'BL DATE': 'bl_date', 'VESSEL': 'vessel', 'DESTINATION': 'destination', 'CHARTERERS': 'charterer', 'API': 'api'}, inplace=True) df_2 = df_2.reindex( columns=["port_name", "crude", "reporting_month", "bl_date", "vessel", "destination", "Charterer", "api"]) # return df_2 df_2.to_csv('Outputfile.csv', index=False) print("Sucessfully generated output CSV") RE: extract table from multiple pages - sshree43 - Jul-31-2022 yes the complete code is posted till output RE: extract table from multiple pages - snippsat - Jul-31-2022 Maybe you do this more complicated than it need to be. Here a test and i put 3 first pages into a Pandas DataFrame. The last tree pages are different,so i would add API later if needed.If change lst[:3] to lst than all pages will there but API column will be under Date.import pdfplumber import pandas as pd pdf_file = "test.pdf" with pdfplumber.open(pdf_file) as pdf: lst = [p.extract_table() for p in pdf.pages] flat_list = [item for sublist in lst[:3] for item in sublist] df = pd.DataFrame(flat_list) df.columns = df.iloc[0] df = df[1:] >>> df 0 BL Date Vessel Destination CHARTERERS 1 6/Jan/2022 Test1 Test2 Test3 2 10/Jan/2022 Test2 Test3 Test4 3 18/Jan/2022 Test3 Test4 Test5 4 23/Jan/2022 Test4 Test5 Test6 5 28/Jan/2022 Test5 Test6 Test7 .. ... ... ... ... 139 6/May/2022 Test139 Test140 Test141 140 6/May/2022 Test140 Test141 Test142 141 14/May/2022 Test141 Test142 Test143 142 23/May/2022 Test142 Test143 Test144 143 29/May/2022 Test143 Test144 Test145 [143 rows x 4 columns] >>> df.dtypes 0 BL Date object # Need to change to Pandas datetime64 Vessel object Destination object CHARTERERS object dtype: object >>> df['BL Date'].head() 1 6/Jan/2022 2 10/Jan/2022 3 18/Jan/2022 4 23/Jan/2022 5 28/Jan/2022 Name: BL Date, dtype: objectLast tree pages would be. Change to lst[3:] .>>> df 0 API 1 10.00 2 10.00 3 10.00 4 10.00 5 10.00 .. ... 139 10.00 140 10.00 141 10.00 142 10.00 143 10.00 [143 rows x 1 columns] RE: extract table from multiple pages - sshree43 - Jul-31-2022 Hi Expert, when i added some more tables in pdf then its error out and also headers are missing into it newpdf file attached with few more tables newinputdata.pdf RE: extract table from multiple pages - sshree43 - Aug-01-2022 suggestion please RE: extract table from multiple pages - snippsat - Aug-01-2022 (Jul-31-2022, 08:58 PM)sshree43 Wrote: when i added some more tables in pdf then its error out and also headers are missing into itIf you make the pdf should try to keep all info in the tables then it easier to parse. Now is text header between tables,then need to use .extract_text() to get header info as .extract_table() will not get that info.Just running code i posted before with newinputdata.pdf get all tables put not header text between >>> df 0 BL Date Vessel Destination CHARTERERS API 1 6-Jan-2022 Test1 Test2 Test3 10.00 2 6-Jan-2022 Test1 Test2 Test3 10.00 3 6-Jan-2022 Test1 Test2 Test3 10.00 4 6-Jan-2022 Test1 Test2 Test3 10.00 5 6-Jan-2022 Test1 Test2 Test3 10.00 .. ... ... ... ... ... 70 6-Jan-2022 Test1 Test2 Test3 10.00 71 6-Jan-2022 Test1 Test2 Test3 10.00 72 6-Jan-2022 Test1 Test2 Test3 10.00 73 6-Jan-2022 Test1 Test2 Test3 10.00 74 6-Jan-2022 Test1 Test2 Test3 10.00 [74 rows x 5 columns]If use . extract_text() will get all,but have to clean up that i will not look into now.import pdfplumber pdf_file = "newinputdata.pdf" with pdfplumber.open(pdf_file) as pdf: pages = pdf.pagesTest usage of code over. >>> pages [<Page:1>, <Page:2>] >>> pages[0] <Page:1> >>> pages[0].extract_text() ('Port: Test Grade Name: ' 'Testnew Month ' 'Reporting: Jan-22\n' 'BL Date Vessel Destination CHARTERERS API\n' '6-Jan-2022 Test1 Test2 Test3 10.00\n' '10-Jan-2022 Test2 Test3 Test4 10.00\n' '18-Jan-2022 Test3 Test4 Test5 10.00\n' '23-Jan-2022 Test4 Test5 Test6 10.00\n' 'Port: Test 1333 2 Grade Name: Testnew ' '1 Month ' 'Reporting: Jan-24\n' 'BL Date Vessel Destination API CHARTERERS\n' '6-Jan-2022 Test1 Test2 10.00 Test3\n' 'Port: Test 1333 2 Grade Name: Testnew ' '1 Month ' 'Reporting: Jan-24\n' 'BL Date Vessel Destination API CHARTERERS\n' '6-Jan-2022 Test1 Test2 10.00 Test3\n' 'Port: Test 1333 2 Grade Name: Testnew ' '1 Month ' 'Reporting: Jan-24\n' 'BL Date Vessel Destination API CHARTERERS\n' '6-Jan-2022 Test1 Test2 10.00 Test3\n' '6-Jan-2022 Test1 Test2 10.00 Test3\n' 'Port: Test 1333 2 Grade Name: Testnew ' '1 Month ' 'Reporting: Jan-24\n' 'BL Date Vessel Destination API CHARTERERS\n' '6-Jan-2022 Test1 Test2 10.00 Test3\n' 'Port: Test Grade Name: ' 'Testnew Month ' 'Reporting: Jan-22\n' 'BL Date Vessel Destination CHARTERERS API\n' '6-Jan-2022 Test1 Test2 Test3 10.00\n' '10-Jan-2022 Test2 Test3 Test4 10.00\n' 'Port: Test Grade Name: ' 'Testnew Month ' 'Reporting: Jan-22\n' 'BL Date Vessel Destination CHARTERERS API\n' '6-Jan-2022 Test1 Test2 Test3 10.00\n' '6-Jan-2022 Test1 Test2 Test3 10.00\n' '6-Jan-2022 Test1 Test2 Test3 10.00\n' '6-Jan-2022 Test1 Test2 Test3 10.00\n' '6-Jan-2022 Test1 Test2 Test3 10.00\n' '6-Jan-2022 Test1 Test2 Test3 10.00\n' '6-Jan-2022 Test1 Test2 Test3 10.00\n' '6-Jan-2022 Test1 Test2 Test3 10.00\n' '6-Jan-2022 Test1 Test2 Test3 10.00\n' '6-Jan-2022 Test1 Test2 Test3 10.00\n' '6-Jan-2022 Test1 Test2 Test3 10.00\n' '6-Jan-2022 Test1 Test2 Test3 10.00\n' '6-Jan-2022 Test1 Test2 Test3 10.00\n' '6-Jan-2022 Test1 Test2 Test3 10.00\n' '6-Jan-2022 Test1 Test2 Test3 10.00\n' '6-Jan-2022 Test1 Test2 Test3 10.00\n' '6-Jan-2022 Test1 Test2 Test3 10.00\n' '6-Jan-2022 Test1 Test2 Test3 10.00\n' '6-Jan-2022 Test1 Test2 Test3 10.00\n' '6-Jan-2022 Test1 Test2 Test3 10.00\n' '6-Jan-2022 Test1 Test2 Test3 10.00') RE: extract table from multiple pages - sshree43 - Aug-01-2022 Hi Expert, Sorry, but i do not understand your code. Can you please write exact code for header and multiple table available in pdf. that is in newinputdata.pdf that sit RE: extract table from multiple pages - sshree43 - Aug-02-2022 Any other suggestion please RE: extract table from multiple pages - arvin - Dec-12-2022 Hello. I have a pdf which has data in tabular format and has 6 columns but the columns are not separated by boundaries so when I extract the data, all the data comes in one cell only and I want in separate cells. How could I do that? For your reference: "15/03/2021 RTGS-UTIBR52021031300662458-VIRENDER KUMAR 2,60,635.00 2,94,873.94Cr "11/03/2021 IMPS/P2A/107018040382/XXXXXXXXXX0980/trf 49,500.00 34,238.94Cr "11/03/2021 IMPS/P2A/107018771795/KINGDOMHOTELAND/trf 35,000.00 83,738.94Cr Thanks in advance. |