extract table from multiple pages

extract table from multiple pages - Printable Version

+- Python Forum (https://python-forum.io)
+-- Forum: Python Coding (https://python-forum.io/forum-7.html)
+--- Forum: General Coding Help (https://python-forum.io/forum-8.html)
+--- Thread: extract table from multiple pages (/thread-37870.html)

extract table from multiple pages - sshree43 - Jul-31-2022

Hi Expert,

I am trying to get table extract from multiple pdf pages but i am getting only 2 pages and page header currently(Source PDF(test.pdf),output.csv file, codetext.txt are added as attachment

Expectation: it should read the entire data from PDF. Currently it is reading only partial data

Here is my code

import tabula
import requests
import csv
import pandas as pd

import re
import parse
import pdfplumber
from collections import namedtuple
import datetime
from datetime import date
import os
import glob
import shutil
from os import path

# using pdminer i am extracting all the post name , grade name and month repporting to add to this cleaned data frame.


# ------------------------------------File name
file = "C:\\Users\\xxx\\Downloads\\test.pdf"

lines = []

pnames = []
gnames = []
mreports = []
with pdfplumber.open(file) as pdf:
    for page in pdf.pages:
        try:
            text = page.extract_text()
        except:
            text = ''
        if text is not None:
            liness = text.split('\n')
            lines += liness

for li in lines:
    if "Port:" in li:
        li = li.replace("Port:", "").strip()
        li_new = li.split("Month Reporting:")[-0].strip()
        m_repor = li.split("Month Reporting:")[-1].strip()

        if "Grade Name:" in li_new:
            g_name = li_new.split("Grade Name:")[-1].strip()
            p_name = li_new.split("Grade Name:")[0].strip()
            print(li_new)
        else:
            g_name = li_new.split()[1:]
            g_name = ' '.join(g_name).strip()
            p_name = li_new.split()[0].strip()
        pnames.append(p_name)
        gnames.append(g_name)
        mreports.append(m_repor)
print("PortName: ", len(pnames))
print("GradeName: ", len(gnames))
print("MonthReporting: ", len(mreports))

# i am using tabula to extract all the tables from pdf and this table is cleaned for final joining.
df = tabula.read_pdf(file, pages='all')
final_list = [
    ["PORT NAME", "GRADE NAME", "MONTH REPORTING", "BL DATE", "VESSEL", "DESTINATION", "CHARTERERS", "API"]]
# final_list=[]
print(final_list)
last_df = len(df)
print("Length of tables: ", last_df)

for i in range(0, len(pnames)):
    op_df = df[i]
    op_df = op_df.dropna(how='all')
    op_df_list = op_df.values.tolist()

    for li in op_df_list:
        if str(li[0]) == "nan":
            li = li[1:]
        else:
            print("check this case")
            print(li)
        li.insert(0, pnames[i])
        li.insert(1, gnames[i])
        li.insert(2, mreports[i])
        print(li)
        if "BL Date" in li:
            pass
        else:
            final_list.append(li)
    df_2 = pd.DataFrame(final_list)
    df_2.columns = df_2.iloc[0]
    df_2 = df_2[1:]
    max_row=len(df_2)
    curr_date = datetime.datetime.now()
    created_date = curr_date.strftime('%d-%b-%y')
    for row in range(max_row):
        df_2['created_by'] = 'created by'
        df_2['created_date'] = created_date

    print(df_2)
    df_2.rename(
        columns={'PORT NAME': 'port_name', 'GRADE NAME': 'crude', 'MONTH REPORTING': 'reporting_month', 'BL DATE': 'bl_date',
                 'VESSEL': 'vessel', 'DESTINATION': 'destination',
                 'CHARTERERS': 'charterer', 'API': 'api'}, inplace=True)

    df_2 = df_2.reindex(
        columns=["port_name", "crude", "reporting_month", "bl_date", "vessel", "destination", "Charterer",
                 "api"])

    # return df_2


df_2.to_csv('Outputfile.csv', index=False)
print("Sucessfully generated output CSV")

RE: extract table from multiple pages - sshree43 - Jul-31-2022

yes the complete code is posted till output

RE: extract table from multiple pages - snippsat - Jul-31-2022

Maybe you do this more complicated than it need to be.
Here a test and i put 3 first pages into a Pandas DataFrame.
The last tree pages are different,so i would add API later if needed.
If change lst[:3] to lst than all pages will there but API column will be under Date.

import pdfplumber
import pandas as pd

pdf_file = "test.pdf"
with pdfplumber.open(pdf_file) as pdf:
    lst = [p.extract_table() for p in pdf.pages]

flat_list = [item for sublist in lst[:3] for item in sublist]
df = pd.DataFrame(flat_list)
df.columns = df.iloc[0]
df = df[1:]

>>> df
0        BL Date   Vessel Destination CHARTERERS
1     6/Jan/2022    Test1       Test2      Test3
2    10/Jan/2022    Test2       Test3      Test4
3    18/Jan/2022    Test3       Test4      Test5
4    23/Jan/2022    Test4       Test5      Test6
5    28/Jan/2022    Test5       Test6      Test7
..           ...      ...         ...        ...
139   6/May/2022  Test139     Test140    Test141
140   6/May/2022  Test140     Test141    Test142
141  14/May/2022  Test141     Test142    Test143
142  23/May/2022  Test142     Test143    Test144
143  29/May/2022  Test143     Test144    Test145

[143 rows x 4 columns]

>>> df.dtypes
0
BL Date        object # Need to change to Pandas datetime64
Vessel         object
Destination    object
CHARTERERS     object
dtype: object

>>> df['BL Date'].head()
1     6/Jan/2022
2    10/Jan/2022
3    18/Jan/2022
4    23/Jan/2022
5    28/Jan/2022
Name: BL Date, dtype: object

Last tree pages would be.
Change to lst[3:].

>>> df
0      API
1    10.00
2    10.00
3    10.00
4    10.00
5    10.00
..     ...
139  10.00
140  10.00
141  10.00
142  10.00
143  10.00

[143 rows x 1 columns]

RE: extract table from multiple pages - sshree43 - Jul-31-2022

Hi Expert,

when i added some more tables in pdf then its error out and also headers are missing into it

newpdf file attached with few more tables

newinputdata.pdf

RE: extract table from multiple pages - sshree43 - Aug-01-2022

suggestion please

RE: extract table from multiple pages - snippsat - Aug-01-2022

(Jul-31-2022, 08:58 PM)sshree43 Wrote: when i added some more tables in pdf then its error out and also headers are missing into it

If you make the pdf should try to keep all info in the tables then it easier to parse.
Now is text header between tables,then need to use .extract_text() to get header info as .extract_table() will not get that info.
Just running code i posted before with newinputdata.pdf get all tables put not header text between

>>> df
0      BL Date Vessel Destination CHARTERERS    API
1   6-Jan-2022  Test1       Test2      Test3  10.00
2   6-Jan-2022  Test1       Test2      Test3  10.00
3   6-Jan-2022  Test1       Test2      Test3  10.00
4   6-Jan-2022  Test1       Test2      Test3  10.00
5   6-Jan-2022  Test1       Test2      Test3  10.00
..         ...    ...         ...        ...    ...
70  6-Jan-2022  Test1       Test2      Test3  10.00
71  6-Jan-2022  Test1       Test2      Test3  10.00
72  6-Jan-2022  Test1       Test2      Test3  10.00
73  6-Jan-2022  Test1       Test2      Test3  10.00
74  6-Jan-2022  Test1       Test2      Test3  10.00

[74 rows x 5 columns]

If use .extract_text() will get all,but have to clean up that i will not look into now.

import pdfplumber

pdf_file = "newinputdata.pdf"
with pdfplumber.open(pdf_file) as pdf:
    pages = pdf.pages

Test usage of code over.

>>> pages
[<Page:1>, <Page:2>]
>>> pages[0]
<Page:1>
>>> pages[0].extract_text()
('Port:                Test         Grade Name: '
 'Testnew                                                Month '
 'Reporting:                      Jan-22\n'
 'BL Date Vessel Destination CHARTERERS API\n'
 '6-Jan-2022 Test1 Test2 Test3 10.00\n'
 '10-Jan-2022 Test2 Test3 Test4 10.00\n'
 '18-Jan-2022 Test3 Test4 Test5 10.00\n'
 '23-Jan-2022 Test4 Test5 Test6 10.00\n'
 'Port:                Test 1333 2       Grade Name: Testnew '
 '1                                               Month '
 'Reporting:                      Jan-24\n'
 'BL Date Vessel Destination API CHARTERERS\n'
 '6-Jan-2022 Test1 Test2 10.00 Test3\n'
 'Port:                Test 1333 2       Grade Name: Testnew '
 '1                                               Month '
 'Reporting:                      Jan-24\n'
 'BL Date Vessel Destination API CHARTERERS\n'
 '6-Jan-2022 Test1 Test2 10.00 Test3\n'
 'Port:                Test 1333 2       Grade Name: Testnew '
 '1                                               Month '
 'Reporting:                      Jan-24\n'
 'BL Date Vessel Destination API CHARTERERS\n'
 '6-Jan-2022 Test1 Test2 10.00 Test3\n'
 '6-Jan-2022 Test1 Test2 10.00 Test3\n'
 'Port:                Test 1333 2       Grade Name: Testnew '
 '1                                               Month '
 'Reporting:                      Jan-24\n'
 'BL Date Vessel Destination API CHARTERERS\n'
 '6-Jan-2022 Test1 Test2 10.00 Test3\n'
 'Port:                Test         Grade Name: '
 'Testnew                                                Month '
 'Reporting:                      Jan-22\n'
 'BL Date Vessel Destination CHARTERERS API\n'
 '6-Jan-2022 Test1 Test2 Test3 10.00\n'
 '10-Jan-2022 Test2 Test3 Test4 10.00\n'
 'Port:                Test         Grade Name: '
 'Testnew                                                Month '
 'Reporting:                      Jan-22\n'
 'BL Date Vessel Destination CHARTERERS API\n'
 '6-Jan-2022 Test1 Test2 Test3 10.00\n'
 '6-Jan-2022 Test1 Test2 Test3 10.00\n'
 '6-Jan-2022 Test1 Test2 Test3 10.00\n'
 '6-Jan-2022 Test1 Test2 Test3 10.00\n'
 '6-Jan-2022 Test1 Test2 Test3 10.00\n'
 '6-Jan-2022 Test1 Test2 Test3 10.00\n'
 '6-Jan-2022 Test1 Test2 Test3 10.00\n'
 '6-Jan-2022 Test1 Test2 Test3 10.00\n'
 '6-Jan-2022 Test1 Test2 Test3 10.00\n'
 '6-Jan-2022 Test1 Test2 Test3 10.00\n'
 '6-Jan-2022 Test1 Test2 Test3 10.00\n'
 '6-Jan-2022 Test1 Test2 Test3 10.00\n'
 '6-Jan-2022 Test1 Test2 Test3 10.00\n'
 '6-Jan-2022 Test1 Test2 Test3 10.00\n'
 '6-Jan-2022 Test1 Test2 Test3 10.00\n'
 '6-Jan-2022 Test1 Test2 Test3 10.00\n'
 '6-Jan-2022 Test1 Test2 Test3 10.00\n'
 '6-Jan-2022 Test1 Test2 Test3 10.00\n'
 '6-Jan-2022 Test1 Test2 Test3 10.00\n'
 '6-Jan-2022 Test1 Test2 Test3 10.00\n'
 '6-Jan-2022 Test1 Test2 Test3 10.00')

RE: extract table from multiple pages - sshree43 - Aug-01-2022

Hi Expert,
Sorry, but i do not understand your code. Can you please write exact code for header and multiple table available in pdf. that is in newinputdata.pdf that sit

RE: extract table from multiple pages - sshree43 - Aug-02-2022

Any other suggestion please

RE: extract table from multiple pages - arvin - Dec-12-2022

Hello.
I have a pdf which has data in tabular format and has 6 columns but the columns are not separated by boundaries so when I extract the data, all the data comes in one cell only and I want in separate cells.

How could I do that?

For your reference:
"15/03/2021 RTGS-UTIBR52021031300662458-VIRENDER KUMAR 2,60,635.00 2,94,873.94Cr
"11/03/2021 IMPS/P2A/107018040382/XXXXXXXXXX0980/trf 49,500.00 34,238.94Cr
"11/03/2021 IMPS/P2A/107018771795/KINGDOMHOTELAND/trf 35,000.00 83,738.94Cr

Thanks in advance.