Python Forum

Full Version: pdf to mp3
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
I'm mostly a micro python guy on ESP32 devices but recently ran across a lot of python programming videos on youtube. One caught my eye and have been trying to get it working in PyCharm. It is a program to convert a PDF file to text and then convert that to an mp3 file which is an audio file speaking the text that was in the PDF. Here is a link to the video... https://www.youtube.com/watch?v=LXsdt6RMNfY she is using oneAPI which also requires visual studio. I have installed them but they are a bit intimidating.
It is mostly working in pycharm... it does print out the text but instead of speaking the text it is saying each letter in the file.

This is my version of her program... hers had some deprecation issues though is was a fairly recent video.
Not sure how I did it but at one point it displayed the output text file and it was only one letter per line which would kind of explain why it is saying each letter instead of the words.
I tired writing the text to a file but got an error.... file must have a 'write' attribute
I've included test_pdf.pdf
import pyttsx3,PyPDF2
import pickle
from PyPDF2 import PdfReader

reader = PdfReader('test_pdf.pdf')
speaker = pyttsx3.init()
text = reader.pages[0]
file = text.extract_text()
clean_text = file.strip().replace('\n', ' ')
#with open('resume.txt', 'wb') as pickle_file:
 #   pickle.dump(file, 'pickle_file')
print(file)

#pickle.dump(clean_text, 'resume.txt')
#name mp3 file whatever you would like
speaker.say(file)   
#speaker.save_to_file(file, 'story.mp3')
speaker.runAndWait()

speaker.stop()
Ok. So, this is not an issue with your program. This is the result of the text formatting on your PDF.

PDF files are kinda weird when it comes to text and these funky results are very common when extractig text.

If you look closely to the output of extract_text() you will see that there is one extra space " " between each character. One way to get rid of it is by doing so:

import pyttsx3
import pickle
from PyPDF2 import PdfReader

reader = PdfReader('test_pdf.pdf')
speaker = pyttsx3.init()
text = reader.pages[0]
file = text.extract_text()

# Removes double spaces.
clean_text = file.split(" ")

# Substitutes every empty string (actual spaces) for spaces.
clean_text = [c if c != "" else " " for c in clean_text]

# Puts back text again.
clean_text = "".join(clean_text)

# Rejoins everything in a single string.
clean_text = " ".join(clean_text.split())

# Be sure the pass `cleat_text` to the speaker.
speaker.say(clean_text)
speaker.runAndWait()
speaker.stop()
I am absolutely sure at some point a Python Magician will come with a much better and optimised solution, tho. Be sure to come back later to check for new answers.
Thanks very much for your insight and solution Big Grin
Have to clean the as text carecavoador do,which is ok.
An other soution is to use pdfplumber which is good with text extract in pdf's.
import pdfplumber
import pyttsx3

speaker = pyttsx3.init()
pdf_file = "test_pdf.pdf"
with pdfplumber.open(pdf_file) as pdf:
    pages = pdf.pages
    for page_nr, pg in enumerate(pages, 1):
        text = pg.extract_text()
        speaker.say(text)
        speaker.say(f'Ther was {page_nr} page in {pdf_file}')
        speaker.runAndWait()
        speaker.stop()
As the old saying goes there is more than one way to skin a cat... still pretty new to python... amazed at all the pre written utilities that are available... glad I rediscovered this site... I signed up quite some time ago but had forgot about it.