PDFminer outputs unreadable text during conversion from PDF to TXT

Thread Rating:

0 Vote(s) - 0 Average
1
2
3
4
5

Thread Modes

PDFminer outputs unreadable text during conversion from PDF to TXT

Pedroski55
Da Bishop

Posts: 1,093

Threads: 143

Joined: Jul 2017

Reputation: 37

Aug-05-2024, 08:47 AM

I searched a lot, but I can't find an answer.

The PDF has 2 embedded fonts. They are both types of Times Roman, which is a common font. The PDF was made on MacOS.

Maybe if you Python it on an apple computer, you will get the correct output.

I tried saving as binary, then opening but that did not work:

with pymupdf.open(path2pdf) as doc:  # open document
    text = chr(12).join([page.get_text() for page in doc])
    # write as a binary file to support non-ASCII characters
    pathlib.Path(path2pdf + ".txt").write_bytes(text.encode())

with open(path2text, encoding="utf-8") as f:
    text = f.read()

Like I said, the PDF displays correctly, so the information must be in there! How to extract it?? Confused

Find

Messages In This Thread

PDFminer outputs unreadable text during conversion from PDF to TXT - by Gromila131 - Aug-03-2024, 07:25 AM

RE: PDFminer outputs unreadable text during conversion from PDF to TXT - by Pedroski55 - Aug-04-2024, 09:15 AM

RE: PDFminer outputs unreadable text during conversion from PDF to TXT - by Gromila131 - Aug-05-2024, 07:52 AM

RE: PDFminer outputs unreadable text during conversion from PDF to TXT - by Gromila131 - Aug-05-2024, 11:40 AM

RE: PDFminer outputs unreadable text during conversion from PDF to TXT - by Pedroski55 - Aug-05-2024, 08:47 AM

RE: PDFminer outputs unreadable text during conversion from PDF to TXT - by Pedroski55 - Aug-06-2024, 08:07 AM

RE: PDFminer outputs unreadable text during conversion from PDF to TXT - by Pedroski55 - Aug-06-2024, 08:20 AM

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	Text conversion to lowercase is not working	ineuw	3	1,311	Jan-16-2024, 02:42 AM Last Post: ineuw
	format json outputs !	evilcode1	3	2,626	Oct-29-2023, 01:30 PM Last Post: omemoe277
	Formatting outputs created with .join command	klairel	2	1,512	Aug-23-2023, 08:52 AM Last Post: perfringo
	How to properly scale text in postscript conversion to pdf?	philipbergwerf	3	2,122	Nov-07-2022, 01:30 PM Last Post: philipbergwerf
	pdfminer package: module isn't found	Pavel_47	25	17,029	Sep-18-2022, 08:40 PM Last Post: Larz60+
	I have written a program that outputs data based on GPS signal	kalle	1	2,035	Jul-22-2022, 12:10 AM Last Post: mcmxl22
	Why does absence of print command outputs quotes in function?	Mark17	2	2,063	Jan-04-2022, 07:08 PM Last Post: ndc85430
	Thoughts on interfacing with a QR code reader that outputs keystrokes?	wrybread	1	2,077	Oct-08-2021, 03:44 PM Last Post: bowlofred
	pdfminer to csv	mfernandes	2	3,668	Jun-16-2021, 10:54 AM Last Post: mfernandes
	Combining outputs into a dataframe	rybina	0	2,098	Mar-15-2021, 02:43 PM Last Post: rybina

Users browsing this thread: 1 Guest(s)

View a Printable Version

PDFminer outputs unreadable text during conversion from PDF to TXT

User Panel Messages

Announcements