Python Forum
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
pdf to mp3
#1
I'm mostly a micro python guy on ESP32 devices but recently ran across a lot of python programming videos on youtube. One caught my eye and have been trying to get it working in PyCharm. It is a program to convert a PDF file to text and then convert that to an mp3 file which is an audio file speaking the text that was in the PDF. Here is a link to the video... https://www.youtube.com/watch?v=LXsdt6RMNfY she is using oneAPI which also requires visual studio. I have installed them but they are a bit intimidating.
It is mostly working in pycharm... it does print out the text but instead of speaking the text it is saying each letter in the file.

This is my version of her program... hers had some deprecation issues though is was a fairly recent video.
Not sure how I did it but at one point it displayed the output text file and it was only one letter per line which would kind of explain why it is saying each letter instead of the words.
I tired writing the text to a file but got an error.... file must have a 'write' attribute
I've included test_pdf.pdf
import pyttsx3,PyPDF2
import pickle
from PyPDF2 import PdfReader

reader = PdfReader('test_pdf.pdf')
speaker = pyttsx3.init()
text = reader.pages[0]
file = text.extract_text()
clean_text = file.strip().replace('\n', ' ')
#with open('resume.txt', 'wb') as pickle_file:
 #   pickle.dump(file, 'pickle_file')
print(file)

#pickle.dump(clean_text, 'resume.txt')
#name mp3 file whatever you would like
speaker.say(file)   
#speaker.save_to_file(file, 'story.mp3')
speaker.runAndWait()

speaker.stop()
Yoriz write Dec-30-2022, 04:05 PM:
Please post all code, output and errors (in their entirety) between their respective tags. Refer to BBCode help topic on how to post. Use the "Preview Post" button to make sure the code is presented as you expect before hitting the "Post Reply/Thread" button.

Attached Files

.pdf   test_pdf.pdf (Size: 8.48 KB / Downloads: 119)
Reply
#2
Ok. So, this is not an issue with your program. This is the result of the text formatting on your PDF.

PDF files are kinda weird when it comes to text and these funky results are very common when extractig text.

If you look closely to the output of extract_text() you will see that there is one extra space " " between each character. One way to get rid of it is by doing so:

import pyttsx3
import pickle
from PyPDF2 import PdfReader

reader = PdfReader('test_pdf.pdf')
speaker = pyttsx3.init()
text = reader.pages[0]
file = text.extract_text()

# Removes double spaces.
clean_text = file.split(" ")

# Substitutes every empty string (actual spaces) for spaces.
clean_text = [c if c != "" else " " for c in clean_text]

# Puts back text again.
clean_text = "".join(clean_text)

# Rejoins everything in a single string.
clean_text = " ".join(clean_text.split())

# Be sure the pass `cleat_text` to the speaker.
speaker.say(clean_text)
speaker.runAndWait()
speaker.stop()
I am absolutely sure at some point a Python Magician will come with a much better and optimised solution, tho. Be sure to come back later to check for new answers.
snippsat and yawstick like this post
Reply
#3
Thanks very much for your insight and solution Big Grin
Reply
#4
Have to clean the as text carecavoador do,which is ok.
An other soution is to use pdfplumber which is good with text extract in pdf's.
import pdfplumber
import pyttsx3

speaker = pyttsx3.init()
pdf_file = "test_pdf.pdf"
with pdfplumber.open(pdf_file) as pdf:
    pages = pdf.pages
    for page_nr, pg in enumerate(pages, 1):
        text = pg.extract_text()
        speaker.say(text)
        speaker.say(f'Ther was {page_nr} page in {pdf_file}')
        speaker.runAndWait()
        speaker.stop()
yawstick likes this post
Reply
#5
As the old saying goes there is more than one way to skin a cat... still pretty new to python... amazed at all the pre written utilities that are available... glad I rediscovered this site... I signed up quite some time ago but had forgot about it.
Reply


Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020