Python Forum
extract table from multiple pages
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
extract table from multiple pages
#1
Hi Expert,

I am trying to get table extract from multiple pdf pages but i am getting only 2 pages and page header currently(Source PDF(test.pdf),output.csv file, codetext.txt are added as attachment

Expectation: it should read the entire data from PDF. Currently it is reading only partial data

Here is my code

import tabula
import requests
import csv
import pandas as pd

import re
import parse
import pdfplumber
from collections import namedtuple
import datetime
from datetime import date
import os
import glob
import shutil
from os import path

# using pdminer i am extracting all the post name , grade name and month repporting to add to this cleaned data frame.


# ------------------------------------File name
file = "C:\\Users\\xxx\\Downloads\\test.pdf"

lines = []

pnames = []
gnames = []
mreports = []
with pdfplumber.open(file) as pdf:
    for page in pdf.pages:
        try:
            text = page.extract_text()
        except:
            text = ''
        if text is not None:
            liness = text.split('\n')
            lines += liness

for li in lines:
    if "Port:" in li:
        li = li.replace("Port:", "").strip()
        li_new = li.split("Month Reporting:")[-0].strip()
        m_repor = li.split("Month Reporting:")[-1].strip()

        if "Grade Name:" in li_new:
            g_name = li_new.split("Grade Name:")[-1].strip()
            p_name = li_new.split("Grade Name:")[0].strip()
            print(li_new)
        else:
            g_name = li_new.split()[1:]
            g_name = ' '.join(g_name).strip()
            p_name = li_new.split()[0].strip()
        pnames.append(p_name)
        gnames.append(g_name)
        mreports.append(m_repor)
print("PortName: ", len(pnames))
print("GradeName: ", len(gnames))
print("MonthReporting: ", len(mreports))

# i am using tabula to extract all the tables from pdf and this table is cleaned for final joining.
df = tabula.read_pdf(file, pages='all')
final_list = [
    ["PORT NAME", "GRADE NAME", "MONTH REPORTING", "BL DATE", "VESSEL", "DESTINATION", "CHARTERERS", "API"]]
# final_list=[]
print(final_list)
last_df = len(df)
print("Length of tables: ", last_df)

for i in range(0, len(pnames)):
    op_df = df[i]
    op_df = op_df.dropna(how='all')
    op_df_list = op_df.values.tolist()

    for li in op_df_list:
        if str(li[0]) == "nan":
            li = li[1:]
        else:
            print("check this case")
            print(li)
        li.insert(0, pnames[i])
        li.insert(1, gnames[i])
        li.insert(2, mreports[i])
        print(li)
        if "BL Date" in li:
            pass
        else:
            final_list.append(li)
    df_2 = pd.DataFrame(final_list)
    df_2.columns = df_2.iloc[0]
    df_2 = df_2[1:]
    max_row=len(df_2)
    curr_date = datetime.datetime.now()
    created_date = curr_date.strftime('%d-%b-%y')
    for row in range(max_row):
        df_2['created_by'] = 'created by'
        df_2['created_date'] = created_date

    print(df_2)
    df_2.rename(
        columns={'PORT NAME': 'port_name', 'GRADE NAME': 'crude', 'MONTH REPORTING': 'reporting_month', 'BL DATE': 'bl_date',
                 'VESSEL': 'vessel', 'DESTINATION': 'destination',
                 'CHARTERERS': 'charterer', 'API': 'api'}, inplace=True)

    df_2 = df_2.reindex(
        columns=["port_name", "crude", "reporting_month", "bl_date", "vessel", "destination", "Charterer",
                 "api"])

    # return df_2


df_2.to_csv('Outputfile.csv', index=False)
print("Sucessfully generated output CSV")
Larz60+ write Jul-31-2022, 12:52 PM:
Please post all code, output and errors (it it's entirety) between their respective tags. Refer to BBCode help topic on how to post. Use the "Preview Post" button to make sure the code is presented as you expect before hitting the "Post Reply/Thread" button.
Fixed for you this time. Please use bbcode tags on future posts.

Attached Files

.csv   Outputfile.csv (Size: 5.49 KB / Downloads: 146)
.txt   codetext (1).txt (Size: 3.25 KB / Downloads: 111)
.pdf   test.pdf (Size: 146.37 KB / Downloads: 230)
Reply
#2
yes the complete code is posted till output
Reply
#3
Maybe you do this more complicated than it need to be.
Here a test and i put 3 first pages into a Pandas DataFrame.
The last tree pages are different,so i would add API later if needed.
If change lst[:3] to lst than all pages will there but API column will be under Date.
import pdfplumber
import pandas as pd

pdf_file = "test.pdf"
with pdfplumber.open(pdf_file) as pdf:
    lst = [p.extract_table() for p in pdf.pages]

flat_list = [item for sublist in lst[:3] for item in sublist]
df = pd.DataFrame(flat_list)
df.columns = df.iloc[0]
df = df[1:]
>>> df
0        BL Date   Vessel Destination CHARTERERS
1     6/Jan/2022    Test1       Test2      Test3
2    10/Jan/2022    Test2       Test3      Test4
3    18/Jan/2022    Test3       Test4      Test5
4    23/Jan/2022    Test4       Test5      Test6
5    28/Jan/2022    Test5       Test6      Test7
..           ...      ...         ...        ...
139   6/May/2022  Test139     Test140    Test141
140   6/May/2022  Test140     Test141    Test142
141  14/May/2022  Test141     Test142    Test143
142  23/May/2022  Test142     Test143    Test144
143  29/May/2022  Test143     Test144    Test145

[143 rows x 4 columns]

>>> df.dtypes
0
BL Date        object # Need to change to Pandas datetime64
Vessel         object
Destination    object
CHARTERERS     object
dtype: object

>>> df['BL Date'].head()
1     6/Jan/2022
2    10/Jan/2022
3    18/Jan/2022
4    23/Jan/2022
5    28/Jan/2022
Name: BL Date, dtype: object
Last tree pages would be.
Change to lst[3:].
>>> df
0      API
1    10.00
2    10.00
3    10.00
4    10.00
5    10.00
..     ...
139  10.00
140  10.00
141  10.00
142  10.00
143  10.00

[143 rows x 1 columns]
Reply
#4
Hi Expert,

when i added some more tables in pdf then its error out and also headers are missing into it

newpdf file attached with few more tables

newinputdata.pdf

Attached Files

.pdf   newinputdata.pdf (Size: 33.56 KB / Downloads: 98)
Reply
#5
suggestion please
Reply
#6
(Jul-31-2022, 08:58 PM)sshree43 Wrote: when i added some more tables in pdf then its error out and also headers are missing into it
If you make the pdf should try to keep all info in the tables then it easier to parse.
Now is text header between tables,then need to use .extract_text() to get header info as .extract_table() will not get that info.
Just running code i posted before with newinputdata.pdf get all tables put not header text between
>>> df
0      BL Date Vessel Destination CHARTERERS    API
1   6-Jan-2022  Test1       Test2      Test3  10.00
2   6-Jan-2022  Test1       Test2      Test3  10.00
3   6-Jan-2022  Test1       Test2      Test3  10.00
4   6-Jan-2022  Test1       Test2      Test3  10.00
5   6-Jan-2022  Test1       Test2      Test3  10.00
..         ...    ...         ...        ...    ...
70  6-Jan-2022  Test1       Test2      Test3  10.00
71  6-Jan-2022  Test1       Test2      Test3  10.00
72  6-Jan-2022  Test1       Test2      Test3  10.00
73  6-Jan-2022  Test1       Test2      Test3  10.00
74  6-Jan-2022  Test1       Test2      Test3  10.00

[74 rows x 5 columns]
If use .extract_text() will get all,but have to clean up that i will not look into now.
import pdfplumber

pdf_file = "newinputdata.pdf"
with pdfplumber.open(pdf_file) as pdf:
    pages = pdf.pages
Test usage of code over.
>>> pages
[<Page:1>, <Page:2>]
>>> pages[0]
<Page:1>
>>> pages[0].extract_text()
('Port:                Test         Grade Name: '
 'Testnew                                                Month '
 'Reporting:                      Jan-22\n'
 'BL Date Vessel Destination CHARTERERS API\n'
 '6-Jan-2022 Test1 Test2 Test3 10.00\n'
 '10-Jan-2022 Test2 Test3 Test4 10.00\n'
 '18-Jan-2022 Test3 Test4 Test5 10.00\n'
 '23-Jan-2022 Test4 Test5 Test6 10.00\n'
 'Port:                Test 1333 2       Grade Name: Testnew '
 '1                                               Month '
 'Reporting:                      Jan-24\n'
 'BL Date Vessel Destination API CHARTERERS\n'
 '6-Jan-2022 Test1 Test2 10.00 Test3\n'
 'Port:                Test 1333 2       Grade Name: Testnew '
 '1                                               Month '
 'Reporting:                      Jan-24\n'
 'BL Date Vessel Destination API CHARTERERS\n'
 '6-Jan-2022 Test1 Test2 10.00 Test3\n'
 'Port:                Test 1333 2       Grade Name: Testnew '
 '1                                               Month '
 'Reporting:                      Jan-24\n'
 'BL Date Vessel Destination API CHARTERERS\n'
 '6-Jan-2022 Test1 Test2 10.00 Test3\n'
 '6-Jan-2022 Test1 Test2 10.00 Test3\n'
 'Port:                Test 1333 2       Grade Name: Testnew '
 '1                                               Month '
 'Reporting:                      Jan-24\n'
 'BL Date Vessel Destination API CHARTERERS\n'
 '6-Jan-2022 Test1 Test2 10.00 Test3\n'
 'Port:                Test         Grade Name: '
 'Testnew                                                Month '
 'Reporting:                      Jan-22\n'
 'BL Date Vessel Destination CHARTERERS API\n'
 '6-Jan-2022 Test1 Test2 Test3 10.00\n'
 '10-Jan-2022 Test2 Test3 Test4 10.00\n'
 'Port:                Test         Grade Name: '
 'Testnew                                                Month '
 'Reporting:                      Jan-22\n'
 'BL Date Vessel Destination CHARTERERS API\n'
 '6-Jan-2022 Test1 Test2 Test3 10.00\n'
 '6-Jan-2022 Test1 Test2 Test3 10.00\n'
 '6-Jan-2022 Test1 Test2 Test3 10.00\n'
 '6-Jan-2022 Test1 Test2 Test3 10.00\n'
 '6-Jan-2022 Test1 Test2 Test3 10.00\n'
 '6-Jan-2022 Test1 Test2 Test3 10.00\n'
 '6-Jan-2022 Test1 Test2 Test3 10.00\n'
 '6-Jan-2022 Test1 Test2 Test3 10.00\n'
 '6-Jan-2022 Test1 Test2 Test3 10.00\n'
 '6-Jan-2022 Test1 Test2 Test3 10.00\n'
 '6-Jan-2022 Test1 Test2 Test3 10.00\n'
 '6-Jan-2022 Test1 Test2 Test3 10.00\n'
 '6-Jan-2022 Test1 Test2 Test3 10.00\n'
 '6-Jan-2022 Test1 Test2 Test3 10.00\n'
 '6-Jan-2022 Test1 Test2 Test3 10.00\n'
 '6-Jan-2022 Test1 Test2 Test3 10.00\n'
 '6-Jan-2022 Test1 Test2 Test3 10.00\n'
 '6-Jan-2022 Test1 Test2 Test3 10.00\n'
 '6-Jan-2022 Test1 Test2 Test3 10.00\n'
 '6-Jan-2022 Test1 Test2 Test3 10.00\n'
 '6-Jan-2022 Test1 Test2 Test3 10.00')
Reply
#7
Hi Expert,
Sorry, but i do not understand your code. Can you please write exact code for header and multiple table available in pdf. that is in newinputdata.pdf that sit
Reply
#8
Any other suggestion please
Reply
#9
Hello.
I have a pdf which has data in tabular format and has 6 columns but the columns are not separated by boundaries so when I extract the data, all the data comes in one cell only and I want in separate cells.

How could I do that?

For your reference:
"15/03/2021 RTGS-UTIBR52021031300662458-VIRENDER KUMAR 2,60,635.00 2,94,873.94Cr
"11/03/2021 IMPS/P2A/107018040382/XXXXXXXXXX0980/trf 49,500.00 34,238.94Cr
"11/03/2021 IMPS/P2A/107018771795/KINGDOMHOTELAND/trf 35,000.00 83,738.94Cr

Thanks in advance.
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
Question Using SQLAlchemy, prevent SQLite3 table update by multiple program instances Calab 3 758 Aug-09-2023, 05:51 PM
Last Post: Calab
  python multiprocessing help -- to extract 10 sql table into csv mg24 3 1,401 Nov-20-2022, 11:50 PM
Last Post: mg24
  Extract parts of multiple log-files and put it in a dataframe hasiro 4 2,096 Apr-27-2022, 12:44 PM
Last Post: hasiro
  Slittping table into Multiple tables by rows drunkenneo 1 2,064 Oct-06-2021, 03:17 PM
Last Post: snippsat
  Display table field on multiple lines, 'wordwrap' 3python 0 1,773 Aug-06-2021, 08:17 PM
Last Post: 3python
  Need help on extract dynamic table data Dr_Strange 0 2,498 Apr-30-2021, 07:03 AM
Last Post: Dr_Strange
  Load the data from multiple source files to one table amy83 2 2,597 Apr-27-2021, 12:33 AM
Last Post: Pedroski55
Question How to extract multiple text from a string? chatguy 2 2,388 Feb-28-2021, 07:39 AM
Last Post: bowlofred
  Load data from One oracle Table to Multiple tables amy83 1 1,787 Dec-02-2020, 01:57 AM
Last Post: Larz60+
  How to extract digits in table of image using python SuSeegio 3 3,088 Dec-05-2018, 10:47 AM
Last Post: Larz60+

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020