Table extraction from scanned PDF

RupamKundu · Aug-02-2019, 02:54 AM

Hi everyone,

An amateur python developer here. I am trying to some text extraction from a scanned PDF. The method I am following is scanned PDF to image to text (using Tesseract).I got reasonably good results when the PDF contained only text.
But, when the PDF had tables within them, I did not get any coherent results, i.e., data from different rows and columns are overlapping each other.
Looking for some help in extracting the tables from a scanned PDF - any and all ideas are much appreciated!

**Larz60+** · Aug-03-2019, 02:59 AM

use canelot: https://python-camelot.readthedocs.io/en/latest/

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	Convert Scanned PDF to Searchable PDF	fuzzin	3	2,809	Mar-11-2022, 01:19 PM Last Post: Pedroski55
	Data extraction from a table based on column and row names	tgottsc1	1	2,408	Jan-09-2021, 10:04 PM Last Post: buran
	eml file data extraction	ajetrumpet	2	2,647	Jul-04-2020, 04:34 AM Last Post: ajetrumpet
	Json value extraction	aaronwarwick	1	2,137	Jun-24-2019, 07:23 PM Last Post: micseydel
	Substring extraction	nevendary	6	3,958	Apr-24-2019, 05:41 AM Last Post: nevendary
	String extraction	Scott	3	3,087	Jul-21-2018, 09:01 PM Last Post: buran
	Automating a Data Extraction Process	Harrison	12	8,655	Mar-31-2017, 10:44 AM Last Post: Harrison

Table extraction from scanned PDF

User Panel Messages

Announcements