Python Forum
PDFminer outputs unreadable text during conversion from PDF to TXT
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
PDFminer outputs unreadable text during conversion from PDF to TXT
#1
Need to convert text from PDF to TXT. Trying to do it via pdfminer. Tried other libraries. But all unsuccessfully. How can I solve this problem
import pdfminer.high_level

with open ('2.pdf', 'rb') as file:
     file1 = open (r'2.txt', 'a+')
     pdfminer.high_level.extract_text_to_fp(file, file1)
     file1.close()

Attached Files

.pdf   a.pdf (Size: 91.02 KB / Downloads: 109)
.txt   a.txt (Size: 15.3 KB / Downloads: 102)
Reply
#2
This is a tricky one! Please let me know if you figure it out!

I thought maybe the table lines confused the text reader, so I got each row from the table, but that did not help!

I thought ftfy might be the answer, but the text comes out the same:

import pymupdf
from pathlib import Path
import ftfy

path2pdf = '/home/pedro/Downloads/a.pdf'
path2text = '/home/pedro/Downloads/a.pdf.txt'

doc = pymupdf.open(path2pdf)
for page in doc:
    tabs = page.find_tables()

# there is only 1 table
tab = tabs[0]
data = []
for line in tab.extract(encoding=pymupdf.TEXT_ENCODING_CYRILLIC):  # print cell text for each row
    print(f'This line has {len(line)} cells')
    print(line)
    data.append(line)

for d in data:
    for i in range(len(d)):
        if d[i] == '':
            d[i] = 'empty'
        elif d[i] == None:
            d[i] = 'empty'

row_strings = [''.join(s) for s in data]
text = ''.join(row_strings)
page_text = ftfy.fix_text(text)
I also tried an online OCR but that came back with gibberish too!

Since the pdf displays correctly, the correct information must be in there!
Reply
#3
(Aug-04-2024, 09:15 AM)Pedroski55 Wrote: This is a tricky one! Please let me know if you figure it out!

I thought maybe the table lines confused the text reader, so I got each row from the table, but that did not help!

I thought ftfy might be the answer, but the text comes out the same:

import pymupdf
from pathlib import Path
import ftfy

path2pdf = '/home/pedro/Downloads/a.pdf'
path2text = '/home/pedro/Downloads/a.pdf.txt'

doc = pymupdf.open(path2pdf)
for page in doc:
    tabs = page.find_tables()

# there is only 1 table
tab = tabs[0]
data = []
for line in tab.extract(encoding=pymupdf.TEXT_ENCODING_CYRILLIC):  # print cell text for each row
    print(f'This line has {len(line)} cells')
    print(line)
    data.append(line)

for d in data:
    for i in range(len(d)):
        if d[i] == '':
            d[i] = 'empty'
        elif d[i] == None:
            d[i] = 'empty'

row_strings = [''.join(s) for s in data]
text = ''.join(row_strings)
page_text = ftfy.fix_text(text)
I also tried an online OCR but that came back with gibberish too!

Since the pdf displays correctly, the correct information must be in there!

Now I'm converting to the correct format via online services. PDF--->DOCX---->TXT
In this case, the text comes out line by line and is readable.
But I can't make such a converter in Python
Reply
#4
I searched a lot, but I can't find an answer.

The PDF has 2 embedded fonts. They are both types of Times Roman, which is a common font. The PDF was made on MacOS.

Maybe if you Python it on an apple computer, you will get the correct output.

I tried saving as binary, then opening but that did not work:

with pymupdf.open(path2pdf) as doc:  # open document
    text = chr(12).join([page.get_text() for page in doc])
    # write as a binary file to support non-ASCII characters
    pathlib.Path(path2pdf + ".txt").write_bytes(text.encode())

with open(path2text, encoding="utf-8") as f:
    text = f.read()
Like I said, the PDF displays correctly, so the information must be in there! How to extract it?? Confused Confused Confused
Reply
#5
(Aug-05-2024, 07:52 AM)Gromila131 Wrote:
(Aug-04-2024, 09:15 AM)Pedroski55 Wrote: This is a tricky one! Please let me know if you figure it out!

I thought maybe the table lines confused the text reader, so I got each row from the table, but that did not help!

I thought ftfy might be the answer, but the text comes out the same:

import pymupdf
from pathlib import Path
import ftfy

path2pdf = '/home/pedro/Downloads/a.pdf'
path2text = '/home/pedro/Downloads/a.pdf.txt'

doc = pymupdf.open(path2pdf)
for page in doc:
    tabs = page.find_tables()

# there is only 1 table
tab = tabs[0]
data = []
for line in tab.extract(encoding=pymupdf.TEXT_ENCODING_CYRILLIC):  # print cell text for each row
    print(f'This line has {len(line)} cells')
    print(line)
    data.append(line)

for d in data:
    for i in range(len(d)):
        if d[i] == '':
            d[i] = 'empty'
        elif d[i] == None:
            d[i] = 'empty'

row_strings = [''.join(s) for s in data]
text = ''.join(row_strings)
page_text = ftfy.fix_text(text)
I also tried an online OCR but that came back with gibberish too!

Since the pdf displays correctly, the correct information must be in there!

Now I'm converting to the correct format via online services. PDF--->DOCX---->TXT
In this case, the text comes out line by line and is readable.
But I can't make such a converter in Python

I didn't tell you why I'm doing this.

I'm currently creating a Python program that will analyze apartment sales. In my country, developers post a project declaration on the website every month, which indicates the number of apartments sold (it's in PDF format).

I need to take the information from there and put it into an Excel table. Maybe I can take it directly from the PDF file? But now I can only extract it from TXT. Here's my code

import openpyxl #работа эксель
from openpyxl import Workbook, load_workbook
import numpy as np
import pandas as pd
import openpyxl as xl
import re
from openpyxl.styles import Alignment, Font
import sys


def main():
    pd_txt = 'obj59216-pd30-000303 (2).txt'

    # Название ЖК
    with open(pd_txt) as text_file:
        for num, line in enumerate(text_file, 1):
            if '10.6.1' in line:
                name_of_the_building = int(num + 1)
                print("Название ЖК:")
                f = open(pd_txt)
                lines = f.readlines()
                NAME_OF_THE_BUILDING = (lines[name_of_the_building])
                print(lines[name_of_the_building])
                f.close()

    # Срок сдачи
    with open(pd_txt) as text_file:
        for num, line in enumerate(text_file, 1):
            if '17.1 (5)' in line:
                deadline = int(num+8)
                print("Срок сдачи:")
                f = open(pd_txt)
                lines = f.readlines()
                DEADLINE = (lines[deadline])
                print(lines[deadline])
                f.close()

    # Квартир всего
    with open(pd_txt) as text_file:
        for num, line in enumerate(text_file, 1):
            if 'Количество жилых помещений:' in line:
                total_apartments = int(num)
                print("Квартир всего:")
                f = open(pd_txt)
                lines = f.readlines()
                TOTAL_APARTMENTS = (lines[total_apartments])
                print(lines[total_apartments])
                f.close()

    # Квартир продано
    with open(pd_txt) as text_file:
        for num, line in enumerate(text_file, 1):
            if '19.7.1.1.1.1' in line:
                sold_apartments=int(num+1)
                print("Квартир продано:")
                f = open(pd_txt)
                lines = f.readlines()
                SOLD_APARTMENTS = (lines[sold_apartments])
                print(lines[sold_apartments])
                f.close()

    # Продано М2
    with open(pd_txt) as text_file:
        for num, line in enumerate(text_file, 1):
            if '19.7.2.1.1.1' in line:
                sold_meters=int(num)
                print("Продано М2:")
                f = open(pd_txt)
                lines = f.readlines()
                just = lines[sold_meters]
                sold_meters_value = just.split(": ")[1]
                SOLD_METERS = (sold_meters_value[:-3])
                print(sold_meters_value[:-3])
                f.close()


    # Заработано ₽
    with open(pd_txt) as text_file:
        for num, line in enumerate(text_file, 1):
            if '19.7.3.1.1.1' in line:
                money_received=int(num+1)
                print("Заработано ₽:")
                f = open(pd_txt)
                lines = f.readlines()
                print(lines [money_received][:-5])
                MONEY_RECEIVED = (lines[money_received][:-5] )
                f.close()


    # Дата загрузки ПД
    with open(pd_txt) as text_file:
        for num, line in enumerate(text_file, 1):
            if 'ПРОЕКТНАЯ ДЕКЛАРАЦИЯ' in line:
                project_declaration_date=int(num)
                print("Дата загрузки ПД")
                f = open(pd_txt)
                lines = f.readlines()
                PROJECT_DECLARATION_DATE = (lines[project_declaration_date][-8:])
                print (lines[project_declaration_date][-8:])
                f.close()

    # Номер ПД
    with open(pd_txt) as text_file:
        for num, line in enumerate(text_file, 1):
            if 'ПРОЕКТНАЯ ДЕКЛАРАЦИЯ' in line:
                project_declaration_num=int(num)
                print("Номер ПД")
                f = open(pd_txt)
                lines = f.readlines()
                PROJECT_DECLARATION_NUM = (lines[project_declaration_num][2:11])
                print (lines[project_declaration_num][2:11])
                f.close()

    """
    gp = 7.2024
    print(PDD)
    print(PDD==gp)
    """

    PDN = (str(PROJECT_DECLARATION_NUM))

    # Existing Excel file
    existing_file = 'excel1.xlsx'
    # New data to append
    new_data = [[PROJECT_DECLARATION_NUM, PROJECT_DECLARATION_DATE, NAME_OF_THE_BUILDING, DEADLINE, TOTAL_APARTMENTS, SOLD_APARTMENTS,
                 SOLD_METERS, MONEY_RECEIVED]]
    # Load existing workbook
    wb = load_workbook(existing_file)
    # Select the active sheet
    ws = wb[PDN]
    # Append new data
    for row in new_data:
        ws.append(row)
    # Save the workbook
    wb.save(existing_file)


main()
Maybe someone will find it useful
Reply
#6
I noticed in some Chinese pdfs, that, internally, the pdf used strange combinations of Chinese characters to represent a letter or letters. Something is going on internally to map combinations of characters to other characters.

I don't know enough about the internal structure of PDFs to figure that out!

Have you managed to successfully get the Russian text from the PDF using Python?
Reply
#7
You can get the data in PDF tables to Pandas like this, but I still have that weird stuff:

import pandas as pd
import fitz # aka pymupdf

path2pdf = '/home/pedro/Downloads/a.pdf'

# get all tables in the PDF
doc = fitz.open(path2pdf)
for page in doc:
    tabs = page.find_tables()
df = pd.DataFrame(tabs[0].extract())
You can save the df as Excel easily.

Please let me know if you find out how to get the proper text!
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  Text conversion to lowercase is not working ineuw 3 1,277 Jan-16-2024, 02:42 AM
Last Post: ineuw
  format json outputs ! evilcode1 3 2,591 Oct-29-2023, 01:30 PM
Last Post: omemoe277
  Formatting outputs created with .join command klairel 2 1,483 Aug-23-2023, 08:52 AM
Last Post: perfringo
  How to properly scale text in postscript conversion to pdf? philipbergwerf 3 2,087 Nov-07-2022, 01:30 PM
Last Post: philipbergwerf
  pdfminer package: module isn't found Pavel_47 25 16,560 Sep-18-2022, 08:40 PM
Last Post: Larz60+
  I have written a program that outputs data based on GPS signal kalle 1 2,004 Jul-22-2022, 12:10 AM
Last Post: mcmxl22
  Why does absence of print command outputs quotes in function? Mark17 2 2,045 Jan-04-2022, 07:08 PM
Last Post: ndc85430
  Thoughts on interfacing with a QR code reader that outputs keystrokes? wrybread 1 2,060 Oct-08-2021, 03:44 PM
Last Post: bowlofred
  pdfminer to csv mfernandes 2 3,629 Jun-16-2021, 10:54 AM
Last Post: mfernandes
  Combining outputs into a dataframe rybina 0 2,071 Mar-15-2021, 02:43 PM
Last Post: rybina

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020