extract table from multiple pages

sshree43 · (This post was last modified: Jul-31-2022, 08:14 PM by sshree43.)

Hi Expert,

I am trying to get table extract from multiple pdf pages but i am getting only 2 pages and page header currently(Source PDF(test.pdf),output.csv file, codetext.txt are added as attachment

Expectation: it should read the entire data from PDF. Currently it is reading only partial data

Here is my code

import tabula
import requests
import csv
import pandas as pd

import re
import parse
import pdfplumber
from collections import namedtuple
import datetime
from datetime import date
import os
import glob
import shutil
from os import path

# using pdminer i am extracting all the post name , grade name and month repporting to add to this cleaned data frame.


# ------------------------------------File name
file = "C:\\Users\\xxx\\Downloads\\test.pdf"

lines = []

pnames = []
gnames = []
mreports = []
with pdfplumber.open(file) as pdf:
    for page in pdf.pages:
        try:
            text = page.extract_text()
        except:
            text = ''
        if text is not None:
            liness = text.split('\n')
            lines += liness

for li in lines:
    if "Port:" in li:
        li = li.replace("Port:", "").strip()
        li_new = li.split("Month Reporting:")[-0].strip()
        m_repor = li.split("Month Reporting:")[-1].strip()

        if "Grade Name:" in li_new:
            g_name = li_new.split("Grade Name:")[-1].strip()
            p_name = li_new.split("Grade Name:")[0].strip()
            print(li_new)
        else:
            g_name = li_new.split()[1:]
            g_name = ' '.join(g_name).strip()
            p_name = li_new.split()[0].strip()
        pnames.append(p_name)
        gnames.append(g_name)
        mreports.append(m_repor)
print("PortName: ", len(pnames))
print("GradeName: ", len(gnames))
print("MonthReporting: ", len(mreports))

# i am using tabula to extract all the tables from pdf and this table is cleaned for final joining.
df = tabula.read_pdf(file, pages='all')
final_list = [
    ["PORT NAME", "GRADE NAME", "MONTH REPORTING", "BL DATE", "VESSEL", "DESTINATION", "CHARTERERS", "API"]]
# final_list=[]
print(final_list)
last_df = len(df)
print("Length of tables: ", last_df)

for i in range(0, len(pnames)):
    op_df = df[i]
    op_df = op_df.dropna(how='all')
    op_df_list = op_df.values.tolist()

    for li in op_df_list:
        if str(li[0]) == "nan":
            li = li[1:]
        else:
            print("check this case")
            print(li)
        li.insert(0, pnames[i])
        li.insert(1, gnames[i])
        li.insert(2, mreports[i])
        print(li)
        if "BL Date" in li:
            pass
        else:
            final_list.append(li)
    df_2 = pd.DataFrame(final_list)
    df_2.columns = df_2.iloc[0]
    df_2 = df_2[1:]
    max_row=len(df_2)
    curr_date = datetime.datetime.now()
    created_date = curr_date.strftime('%d-%b-%y')
    for row in range(max_row):
        df_2['created_by'] = 'created by'
        df_2['created_date'] = created_date

    print(df_2)
    df_2.rename(
        columns={'PORT NAME': 'port_name', 'GRADE NAME': 'crude', 'MONTH REPORTING': 'reporting_month', 'BL DATE': 'bl_date',
                 'VESSEL': 'vessel', 'DESTINATION': 'destination',
                 'CHARTERERS': 'charterer', 'API': 'api'}, inplace=True)

    df_2 = df_2.reindex(
        columns=["port_name", "crude", "reporting_month", "bl_date", "vessel", "destination", "Charterer",
                 "api"])

    # return df_2


df_2.to_csv('Outputfile.csv', index=False)
print("Sucessfully generated output CSV")

Larz60+ write Jul-31-2022, 12:52 PM:
Please post all code, output and errors (it it's entirety) between their respective tags. Refer to BBCode help topic on how to post. Use the "Preview Post" button to make sure the code is presented as you expect before hitting the "Post Reply/Thread" button.
Fixed for you this time. Please use bbcode tags on future posts.

sshree43 · Jul-31-2022, 12:59 PM

yes the complete code is posted till output

***snippsat*** · (This post was last modified: Jul-31-2022, 06:43 PM by snippsat.)

Maybe you do this more complicated than it need to be.
Here a test and i put 3 first pages into a Pandas DataFrame.
The last tree pages are different,so i would add API later if needed.
If change lst[:3] to lst than all pages will there but API column will be under Date.

import pdfplumber
import pandas as pd

pdf_file = "test.pdf"
with pdfplumber.open(pdf_file) as pdf:
    lst = [p.extract_table() for p in pdf.pages]

flat_list = [item for sublist in lst[:3] for item in sublist]
df = pd.DataFrame(flat_list)
df.columns = df.iloc[0]
df = df[1:]

>>> df
0        BL Date   Vessel Destination CHARTERERS
1     6/Jan/2022    Test1       Test2      Test3
2    10/Jan/2022    Test2       Test3      Test4
3    18/Jan/2022    Test3       Test4      Test5
4    23/Jan/2022    Test4       Test5      Test6
5    28/Jan/2022    Test5       Test6      Test7
..           ...      ...         ...        ...
139   6/May/2022  Test139     Test140    Test141
140   6/May/2022  Test140     Test141    Test142
141  14/May/2022  Test141     Test142    Test143
142  23/May/2022  Test142     Test143    Test144
143  29/May/2022  Test143     Test144    Test145

[143 rows x 4 columns]

>>> df.dtypes
0
BL Date        object # Need to change to Pandas datetime64
Vessel         object
Destination    object
CHARTERERS     object
dtype: object

>>> df['BL Date'].head()
1     6/Jan/2022
2    10/Jan/2022
3    18/Jan/2022
4    23/Jan/2022
5    28/Jan/2022
Name: BL Date, dtype: object

Last tree pages would be.
Change to lst[3:].

>>> df
0      API
1    10.00
2    10.00
3    10.00
4    10.00
5    10.00
..     ...
139  10.00
140  10.00
141  10.00
142  10.00
143  10.00

[143 rows x 1 columns]

sshree43 · (This post was last modified: Jul-31-2022, 08:58 PM by sshree43.)

Hi Expert,

when i added some more tables in pdf then its error out and also headers are missing into it

newpdf file attached with few more tables

newinputdata.pdf

sshree43 · Aug-01-2022, 04:21 AM

suggestion please

***snippsat*** · (This post was last modified: Aug-01-2022, 11:55 AM by snippsat.)

(Jul-31-2022, 08:58 PM)sshree43 Wrote: when i added some more tables in pdf then its error out and also headers are missing into it

If you make the pdf should try to keep all info in the tables then it easier to parse.
Now is text header between tables,then need to use .extract_text() to get header info as .extract_table() will not get that info.
Just running code i posted before with newinputdata.pdf get all tables put not header text between

>>> df
0      BL Date Vessel Destination CHARTERERS    API
1   6-Jan-2022  Test1       Test2      Test3  10.00
2   6-Jan-2022  Test1       Test2      Test3  10.00
3   6-Jan-2022  Test1       Test2      Test3  10.00
4   6-Jan-2022  Test1       Test2      Test3  10.00
5   6-Jan-2022  Test1       Test2      Test3  10.00
..         ...    ...         ...        ...    ...
70  6-Jan-2022  Test1       Test2      Test3  10.00
71  6-Jan-2022  Test1       Test2      Test3  10.00
72  6-Jan-2022  Test1       Test2      Test3  10.00
73  6-Jan-2022  Test1       Test2      Test3  10.00
74  6-Jan-2022  Test1       Test2      Test3  10.00

[74 rows x 5 columns]

If use .extract_text() will get all,but have to clean up that i will not look into now.

import pdfplumber

pdf_file = "newinputdata.pdf"
with pdfplumber.open(pdf_file) as pdf:
    pages = pdf.pages

Test usage of code over.

>>> pages
[<Page:1>, <Page:2>]
>>> pages[0]
<Page:1>
>>> pages[0].extract_text()
('Port:                Test         Grade Name: '
 'Testnew                                                Month '
 'Reporting:                      Jan-22\n'
 'BL Date Vessel Destination CHARTERERS API\n'
 '6-Jan-2022 Test1 Test2 Test3 10.00\n'
 '10-Jan-2022 Test2 Test3 Test4 10.00\n'
 '18-Jan-2022 Test3 Test4 Test5 10.00\n'
 '23-Jan-2022 Test4 Test5 Test6 10.00\n'
 'Port:                Test 1333 2       Grade Name: Testnew '
 '1                                               Month '
 'Reporting:                      Jan-24\n'
 'BL Date Vessel Destination API CHARTERERS\n'
 '6-Jan-2022 Test1 Test2 10.00 Test3\n'
 'Port:                Test 1333 2       Grade Name: Testnew '
 '1                                               Month '
 'Reporting:                      Jan-24\n'
 'BL Date Vessel Destination API CHARTERERS\n'
 '6-Jan-2022 Test1 Test2 10.00 Test3\n'
 'Port:                Test 1333 2       Grade Name: Testnew '
 '1                                               Month '
 'Reporting:                      Jan-24\n'
 'BL Date Vessel Destination API CHARTERERS\n'
 '6-Jan-2022 Test1 Test2 10.00 Test3\n'
 '6-Jan-2022 Test1 Test2 10.00 Test3\n'
 'Port:                Test 1333 2       Grade Name: Testnew '
 '1                                               Month '
 'Reporting:                      Jan-24\n'
 'BL Date Vessel Destination API CHARTERERS\n'
 '6-Jan-2022 Test1 Test2 10.00 Test3\n'
 'Port:                Test         Grade Name: '
 'Testnew                                                Month '
 'Reporting:                      Jan-22\n'
 'BL Date Vessel Destination CHARTERERS API\n'
 '6-Jan-2022 Test1 Test2 Test3 10.00\n'
 '10-Jan-2022 Test2 Test3 Test4 10.00\n'
 'Port:                Test         Grade Name: '
 'Testnew                                                Month '
 'Reporting:                      Jan-22\n'
 'BL Date Vessel Destination CHARTERERS API\n'
 '6-Jan-2022 Test1 Test2 Test3 10.00\n'
 '6-Jan-2022 Test1 Test2 Test3 10.00\n'
 '6-Jan-2022 Test1 Test2 Test3 10.00\n'
 '6-Jan-2022 Test1 Test2 Test3 10.00\n'
 '6-Jan-2022 Test1 Test2 Test3 10.00\n'
 '6-Jan-2022 Test1 Test2 Test3 10.00\n'
 '6-Jan-2022 Test1 Test2 Test3 10.00\n'
 '6-Jan-2022 Test1 Test2 Test3 10.00\n'
 '6-Jan-2022 Test1 Test2 Test3 10.00\n'
 '6-Jan-2022 Test1 Test2 Test3 10.00\n'
 '6-Jan-2022 Test1 Test2 Test3 10.00\n'
 '6-Jan-2022 Test1 Test2 Test3 10.00\n'
 '6-Jan-2022 Test1 Test2 Test3 10.00\n'
 '6-Jan-2022 Test1 Test2 Test3 10.00\n'
 '6-Jan-2022 Test1 Test2 Test3 10.00\n'
 '6-Jan-2022 Test1 Test2 Test3 10.00\n'
 '6-Jan-2022 Test1 Test2 Test3 10.00\n'
 '6-Jan-2022 Test1 Test2 Test3 10.00\n'
 '6-Jan-2022 Test1 Test2 Test3 10.00\n'
 '6-Jan-2022 Test1 Test2 Test3 10.00\n'
 '6-Jan-2022 Test1 Test2 Test3 10.00')

sshree43 · (This post was last modified: Aug-01-2022, 08:16 PM by sshree43.)

Hi Expert,
Sorry, but i do not understand your code. Can you please write exact code for header and multiple table available in pdf. that is in newinputdata.pdf that sit

sshree43 · Aug-02-2022, 09:45 AM

Any other suggestion please

arvin · Dec-12-2022, 10:34 AM

Hello.
I have a pdf which has data in tabular format and has 6 columns but the columns are not separated by boundaries so when I extract the data, all the data comes in one cell only and I want in separate cells.

How could I do that?

For your reference:
"15/03/2021 RTGS-UTIBR52021031300662458-VIRENDER KUMAR 2,60,635.00 2,94,873.94Cr
"11/03/2021 IMPS/P2A/107018040382/XXXXXXXXXX0980/trf 49,500.00 34,238.94Cr
"11/03/2021 IMPS/P2A/107018771795/KINGDOMHOTELAND/trf 35,000.00 83,738.94Cr

Thanks in advance.

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	Using SQLAlchemy, prevent SQLite3 table update by multiple program instances	Calab	2	1,714	Aug-09-2023, 05:51 PM Last Post: Calab
	python multiprocessing help -- to extract 10 sql table into csv	mg24	3	2,709	Nov-20-2022, 11:50 PM Last Post: mg24
	Extract parts of multiple log-files and put it in a dataframe	hasiro	4	3,764	Apr-27-2022, 12:44 PM Last Post: hasiro
	Slittping table into Multiple tables by rows	drunkenneo	1	2,940	Oct-06-2021, 03:17 PM Last Post: snippsat
	Display table field on multiple lines, 'wordwrap'	3python	0	2,391	Aug-06-2021, 08:17 PM Last Post: 3python
	Need help on extract dynamic table data	Dr_Strange	0	3,112	Apr-30-2021, 07:03 AM Last Post: Dr_Strange
	Load the data from multiple source files to one table	amy83	2	3,959	Apr-27-2021, 12:33 AM Last Post: Pedroski55
	How to extract multiple text from a string?	chatguy	2	3,345	Feb-28-2021, 07:39 AM Last Post: bowlofred
	Load data from One oracle Table to Multiple tables	amy83	1	2,315	Dec-02-2020, 01:57 AM Last Post: Larz60+
	How to extract digits in table of image using python	SuSeegio	3	3,999	Dec-05-2018, 10:47 AM Last Post: Larz60+

extract table from multiple pages

User Panel Messages

Announcements