Python Forum

Full Version: Superfluous whitespace found?
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
hey! I wrote the following innocent little code:

import glob, os
import PyPDF2

object = open("test" + '.pdf','rb')
reader = PyPDF2.PdfFileReader(object)
page = reader.getPage(0)
extr = page.extractText()
In the end I want python to open all pdf files in a folder, find part of the string inside and rename the file with that. So I wrote the begining and every time I try to run I get this:

Error:
PdfReadWarning: Superfluous whitespace found in object header b'1' b'0' [pdf.py:1666] PdfReadWarning: Superfluous whitespace found in object header b'2' b'0' [pdf.py:1666] PdfReadWarning: Superfluous whitespace found in object header b'3' b'0' [pdf.py:1666] PdfReadWarning: Superfluous whitespace found in object header b'40' b'0' [pdf.py:1666] PdfReadWarning: Superfluous whitespace found in object header b'43' b'0' [pdf.py:1666] PdfReadWarning: Superfluous whitespace found in object header b'46' b'0' [pdf.py:1666] PdfReadWarning: Superfluous whitespace found in object header b'49' b'0' [pdf.py:1666] PdfReadWarning: Superfluous whitespace found in object header b'52' b'0' [pdf.py:1666] PdfReadWarning: Superfluous whitespace found in object header b'55' b'0' [pdf.py:1666] PdfReadWarning: Superfluous whitespace found in object header b'58' b'0' [pdf.py:1666] PdfReadWarning: Superfluous whitespace found in object header b'61' b'0' [pdf.py:1666] PdfReadWarning: Superfluous whitespace found in object header b'64' b'0' [pdf.py:1666] PdfReadWarning: Superfluous whitespace found in object header b'67' b'0' [pdf.py:1666] PdfReadWarning: Superfluous whitespace found in object header b'70' b'0' [pdf.py:1666] PdfReadWarning: Superfluous whitespace found in object header b'73' b'0' [pdf.py:1666] PdfReadWarning: Superfluous whitespace found in object header b'76' b'0' [pdf.py:1666] PdfReadWarning: Superfluous whitespace found in object header b'79' b'0' [pdf.py:1666] PdfReadWarning: Superfluous whitespace found in object header b'82' b'0' [pdf.py:1666] PdfReadWarning: Superfluous whitespace found in object header b'85' b'0' [pdf.py:1666] PdfReadWarning: Superfluous whitespace found in object header b'88' b'0' [pdf.py:1666] PdfReadWarning: Superfluous whitespace found in object header b'91' b'0' [pdf.py:1666] PdfReadWarning: Superfluous whitespace found in object header b'94' b'0' [pdf.py:1666] PdfReadWarning: Superfluous whitespace found in object header b'97' b'0' [pdf.py:1666] PdfReadWarning: Superfluous whitespace found in object header b'100' b'0' [pdf.py:1666] PdfReadWarning: Superfluous whitespace found in object header b'103' b'0' [pdf.py:1666] PdfReadWarning: Superfluous whitespace found in object header b'106' b'0' [pdf.py:1666] PdfReadWarning: Superfluous whitespace found in object header b'109' b'0' [pdf.py:1666] PdfReadWarning: Superfluous whitespace found in object header b'112' b'0' [pdf.py:1666] PdfReadWarning: Superfluous whitespace found in object header b'115' b'0' [pdf.py:1666] PdfReadWarning: Superfluous whitespace found in object header b'118' b'0' [pdf.py:1666] PdfReadWarning: Superfluous whitespace found in object header b'121' b'0' [pdf.py:1666] PdfReadWarning: Superfluous whitespace found in object header b'124' b'0' [pdf.py:1666] PdfReadWarning: Superfluous whitespace found in object header b'127' b'0' [pdf.py:1666] PdfReadWarning: Superfluous whitespace found in object header b'130' b'0' [pdf.py:1666] PdfReadWarning: Superfluous whitespace found in object header b'133' b'0' [pdf.py:1666] PdfReadWarning: Superfluous whitespace found in object header b'136' b'0' [pdf.py:1666] PdfReadWarning: Superfluous whitespace found in object header b'139' b'0' [pdf.py:1666] PdfReadWarning: Superfluous whitespace found in object header b'142' b'0' [pdf.py:1666] PdfReadWarning: Superfluous whitespace found in object header b'145' b'0' [pdf.py:1666] PdfReadWarning: Superfluous whitespace found in object header b'148' b'0' [pdf.py:1666] PdfReadWarning: Superfluous whitespace found in object header b'151' b'0' [pdf.py:1666] PdfReadWarning: Superfluous whitespace found in object header b'154' b'0' [pdf.py:1666] PdfReadWarning: Superfluous whitespace found in object header b'157' b'0' [pdf.py:1666] PdfReadWarning: Superfluous whitespace found in object header b'160' b'0' [pdf.py:1666] PdfReadWarning: Superfluous whitespace found in object header b'163' b'0' [pdf.py:1666] PdfReadWarning: Superfluous whitespace found in object header b'166' b'0' [pdf.py:1666] PdfReadWarning: Superfluous whitespace found in object header b'169' b'0' [pdf.py:1666] PdfReadWarning: Superfluous whitespace found in object header b'172' b'0' [pdf.py:1666] PdfReadWarning: Superfluous whitespace found in object header b'175' b'0' [pdf.py:1666] PdfReadWarning: Superfluous whitespace found in object header b'178' b'0' [pdf.py:1666] PdfReadWarning: Superfluous whitespace found in object header b'181' b'0' [pdf.py:1666] PdfReadWarning: Superfluous whitespace found in object header b'184' b'0' [pdf.py:1666] PdfReadWarning: Superfluous whitespace found in object header b'187' b'0' [pdf.py:1666] PdfReadWarning: Superfluous whitespace found in object header b'190' b'0' [pdf.py:1666] PdfReadWarning: Superfluous whitespace found in object header b'193' b'0' [pdf.py:1666] PdfReadWarning: Superfluous whitespace found in object header b'196' b'0' [pdf.py:1666] PdfReadWarning: Superfluous whitespace found in object header b'199' b'0' [pdf.py:1666] PdfReadWarning: Superfluous whitespace found in object header b'202' b'0' [pdf.py:1666] PdfReadWarning: Superfluous whitespace found in object header b'205' b'0' [pdf.py:1666] PdfReadWarning: Superfluous whitespace found in object header b'208' b'0' [pdf.py:1666] PdfReadWarning: Superfluous whitespace found in object header b'211' b'0' [pdf.py:1666] PdfReadWarning: Superfluous whitespace found in object header b'214' b'0' [pdf.py:1666] PdfReadWarning: Superfluous whitespace found in object header b'217' b'0' [pdf.py:1666] PdfReadWarning: Superfluous whitespace found in object header b'220' b'0' [pdf.py:1666] PdfReadWarning: Superfluous whitespace found in object header b'223' b'0' [pdf.py:1666] PdfReadWarning: Superfluous whitespace found in object header b'226' b'0' [pdf.py:1666] PdfReadWarning: Superfluous whitespace found in object header b'229' b'0' [pdf.py:1666] PdfReadWarning: Superfluous whitespace found in object header b'232' b'0' [pdf.py:1666] PdfReadWarning: Superfluous whitespace found in object header b'235' b'0' [pdf.py:1666] PdfReadWarning: Superfluous whitespace found in object header b'238' b'0' [pdf.py:1666] PdfReadWarning: Superfluous whitespace found in object header b'241' b'0' [pdf.py:1666] PdfReadWarning: Superfluous whitespace found in object header b'38' b'0' [pdf.py:1666]
I tried to return it as string, add .strip() to the end of it and remove whitespaces in a few ways like that but nothing seems to work. I opened the file in word and it had nothing in the headers. What is wrong here?
Hello,
I am facing the same issue with certain pdfs.
I have tried the following: I tried changing the pdf formats to PDF/A from here:
https://www.pdftron.com/pdf-tools/pdfa-converter/
But when i tried to extract the text ,it was completely blank Sad

Is there any other alternative?

I am trying to extract certain words from the pdf and transfer them to a list.
If there is abetter way of doing this linkly let me know,Here is my code.

def read_data_from_pdf_using_pypdf(key_phrases_list, file_name):
    print "File Name:", file_name
    final_data_list_unchecked = []
    # open the pdf file
    object = PyPDF2.PdfFileReader(file_name)

    # get number of pages
    NumPages = object.getNumPages()
    print "Number of pages ", NumPages
    for page_number_counter in range(NumPages):
        if page_number_counter is 1:
            print "Please wait..."
        # print "Page Number:", page_number_counter
        # get page object
        PageObj = object.getPage(page_number_counter)
        # extract text from page object
        text = PageObj.extractText()
        # split texts and add them to a list
        page_text_list = text.split()

        # print page_text_list
        print "Length of list is", len(page_text_list)

        # loops through all elements in the key phrase list
        for key_phrase_counter in range(len(key_phrases_list)):
            # print "Key phrase counter loop:", key_phrase_counter
            # print "Key word search:", key_phrases_list[key_phrase_counter]
            # loop through the list search for keywords.
            for loop_text in page_text_list:
                if loop_text.encode('utf-8').startswith(key_phrases_list[key_phrase_counter]):
                    print loop_text
                    # append the elements to list
                    final_data_list_unchecked.append(loop_text)
    # print "Total number of signals with duplicates:", len(final_data_list_unchecked)
    # print final_data_list_unchecked
    # remove duplicates from list
    final_data_list_checked = list(dict.fromkeys(final_data_list_unchecked))
    # print "Total number of signals:", len(final_data_list_checked)
    # print final_data_list_checked
    return final_data_list_checked
Update:
Hello , i temporarily found a way around this
Import retract and use pdf2text
Note this is also a bit buggy,so you may have to install the poppler binaries and execute py2text from command line.

https://stackoverflow.com/questions/1838...9#53960829