Python Forum
Superfluous whitespace found?
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Superfluous whitespace found?
#1
hey! I wrote the following innocent little code:

import glob, os
import PyPDF2

object = open("test" + '.pdf','rb')
reader = PyPDF2.PdfFileReader(object)
page = reader.getPage(0)
extr = page.extractText()
In the end I want python to open all pdf files in a folder, find part of the string inside and rename the file with that. So I wrote the begining and every time I try to run I get this:

Error:
PdfReadWarning: Superfluous whitespace found in object header b'1' b'0' [pdf.py:1666] PdfReadWarning: Superfluous whitespace found in object header b'2' b'0' [pdf.py:1666] PdfReadWarning: Superfluous whitespace found in object header b'3' b'0' [pdf.py:1666] PdfReadWarning: Superfluous whitespace found in object header b'40' b'0' [pdf.py:1666] PdfReadWarning: Superfluous whitespace found in object header b'43' b'0' [pdf.py:1666] PdfReadWarning: Superfluous whitespace found in object header b'46' b'0' [pdf.py:1666] PdfReadWarning: Superfluous whitespace found in object header b'49' b'0' [pdf.py:1666] PdfReadWarning: Superfluous whitespace found in object header b'52' b'0' [pdf.py:1666] PdfReadWarning: Superfluous whitespace found in object header b'55' b'0' [pdf.py:1666] PdfReadWarning: Superfluous whitespace found in object header b'58' b'0' [pdf.py:1666] PdfReadWarning: Superfluous whitespace found in object header b'61' b'0' [pdf.py:1666] PdfReadWarning: Superfluous whitespace found in object header b'64' b'0' [pdf.py:1666] PdfReadWarning: Superfluous whitespace found in object header b'67' b'0' [pdf.py:1666] PdfReadWarning: Superfluous whitespace found in object header b'70' b'0' [pdf.py:1666] PdfReadWarning: Superfluous whitespace found in object header b'73' b'0' [pdf.py:1666] PdfReadWarning: Superfluous whitespace found in object header b'76' b'0' [pdf.py:1666] PdfReadWarning: Superfluous whitespace found in object header b'79' b'0' [pdf.py:1666] PdfReadWarning: Superfluous whitespace found in object header b'82' b'0' [pdf.py:1666] PdfReadWarning: Superfluous whitespace found in object header b'85' b'0' [pdf.py:1666] PdfReadWarning: Superfluous whitespace found in object header b'88' b'0' [pdf.py:1666] PdfReadWarning: Superfluous whitespace found in object header b'91' b'0' [pdf.py:1666] PdfReadWarning: Superfluous whitespace found in object header b'94' b'0' [pdf.py:1666] PdfReadWarning: Superfluous whitespace found in object header b'97' b'0' [pdf.py:1666] PdfReadWarning: Superfluous whitespace found in object header b'100' b'0' [pdf.py:1666] PdfReadWarning: Superfluous whitespace found in object header b'103' b'0' [pdf.py:1666] PdfReadWarning: Superfluous whitespace found in object header b'106' b'0' [pdf.py:1666] PdfReadWarning: Superfluous whitespace found in object header b'109' b'0' [pdf.py:1666] PdfReadWarning: Superfluous whitespace found in object header b'112' b'0' [pdf.py:1666] PdfReadWarning: Superfluous whitespace found in object header b'115' b'0' [pdf.py:1666] PdfReadWarning: Superfluous whitespace found in object header b'118' b'0' [pdf.py:1666] PdfReadWarning: Superfluous whitespace found in object header b'121' b'0' [pdf.py:1666] PdfReadWarning: Superfluous whitespace found in object header b'124' b'0' [pdf.py:1666] PdfReadWarning: Superfluous whitespace found in object header b'127' b'0' [pdf.py:1666] PdfReadWarning: Superfluous whitespace found in object header b'130' b'0' [pdf.py:1666] PdfReadWarning: Superfluous whitespace found in object header b'133' b'0' [pdf.py:1666] PdfReadWarning: Superfluous whitespace found in object header b'136' b'0' [pdf.py:1666] PdfReadWarning: Superfluous whitespace found in object header b'139' b'0' [pdf.py:1666] PdfReadWarning: Superfluous whitespace found in object header b'142' b'0' [pdf.py:1666] PdfReadWarning: Superfluous whitespace found in object header b'145' b'0' [pdf.py:1666] PdfReadWarning: Superfluous whitespace found in object header b'148' b'0' [pdf.py:1666] PdfReadWarning: Superfluous whitespace found in object header b'151' b'0' [pdf.py:1666] PdfReadWarning: Superfluous whitespace found in object header b'154' b'0' [pdf.py:1666] PdfReadWarning: Superfluous whitespace found in object header b'157' b'0' [pdf.py:1666] PdfReadWarning: Superfluous whitespace found in object header b'160' b'0' [pdf.py:1666] PdfReadWarning: Superfluous whitespace found in object header b'163' b'0' [pdf.py:1666] PdfReadWarning: Superfluous whitespace found in object header b'166' b'0' [pdf.py:1666] PdfReadWarning: Superfluous whitespace found in object header b'169' b'0' [pdf.py:1666] PdfReadWarning: Superfluous whitespace found in object header b'172' b'0' [pdf.py:1666] PdfReadWarning: Superfluous whitespace found in object header b'175' b'0' [pdf.py:1666] PdfReadWarning: Superfluous whitespace found in object header b'178' b'0' [pdf.py:1666] PdfReadWarning: Superfluous whitespace found in object header b'181' b'0' [pdf.py:1666] PdfReadWarning: Superfluous whitespace found in object header b'184' b'0' [pdf.py:1666] PdfReadWarning: Superfluous whitespace found in object header b'187' b'0' [pdf.py:1666] PdfReadWarning: Superfluous whitespace found in object header b'190' b'0' [pdf.py:1666] PdfReadWarning: Superfluous whitespace found in object header b'193' b'0' [pdf.py:1666] PdfReadWarning: Superfluous whitespace found in object header b'196' b'0' [pdf.py:1666] PdfReadWarning: Superfluous whitespace found in object header b'199' b'0' [pdf.py:1666] PdfReadWarning: Superfluous whitespace found in object header b'202' b'0' [pdf.py:1666] PdfReadWarning: Superfluous whitespace found in object header b'205' b'0' [pdf.py:1666] PdfReadWarning: Superfluous whitespace found in object header b'208' b'0' [pdf.py:1666] PdfReadWarning: Superfluous whitespace found in object header b'211' b'0' [pdf.py:1666] PdfReadWarning: Superfluous whitespace found in object header b'214' b'0' [pdf.py:1666] PdfReadWarning: Superfluous whitespace found in object header b'217' b'0' [pdf.py:1666] PdfReadWarning: Superfluous whitespace found in object header b'220' b'0' [pdf.py:1666] PdfReadWarning: Superfluous whitespace found in object header b'223' b'0' [pdf.py:1666] PdfReadWarning: Superfluous whitespace found in object header b'226' b'0' [pdf.py:1666] PdfReadWarning: Superfluous whitespace found in object header b'229' b'0' [pdf.py:1666] PdfReadWarning: Superfluous whitespace found in object header b'232' b'0' [pdf.py:1666] PdfReadWarning: Superfluous whitespace found in object header b'235' b'0' [pdf.py:1666] PdfReadWarning: Superfluous whitespace found in object header b'238' b'0' [pdf.py:1666] PdfReadWarning: Superfluous whitespace found in object header b'241' b'0' [pdf.py:1666] PdfReadWarning: Superfluous whitespace found in object header b'38' b'0' [pdf.py:1666]
I tried to return it as string, add .strip() to the end of it and remove whitespaces in a few ways like that but nothing seems to work. I opened the file in word and it had nothing in the headers. What is wrong here?
Reply
#2
Hello,
I am facing the same issue with certain pdfs.
I have tried the following: I tried changing the pdf formats to PDF/A from here:
https://www.pdftron.com/pdf-tools/pdfa-converter/
But when i tried to extract the text ,it was completely blank Sad

Is there any other alternative?

I am trying to extract certain words from the pdf and transfer them to a list.
If there is abetter way of doing this linkly let me know,Here is my code.

def read_data_from_pdf_using_pypdf(key_phrases_list, file_name):
    print "File Name:", file_name
    final_data_list_unchecked = []
    # open the pdf file
    object = PyPDF2.PdfFileReader(file_name)

    # get number of pages
    NumPages = object.getNumPages()
    print "Number of pages ", NumPages
    for page_number_counter in range(NumPages):
        if page_number_counter is 1:
            print "Please wait..."
        # print "Page Number:", page_number_counter
        # get page object
        PageObj = object.getPage(page_number_counter)
        # extract text from page object
        text = PageObj.extractText()
        # split texts and add them to a list
        page_text_list = text.split()

        # print page_text_list
        print "Length of list is", len(page_text_list)

        # loops through all elements in the key phrase list
        for key_phrase_counter in range(len(key_phrases_list)):
            # print "Key phrase counter loop:", key_phrase_counter
            # print "Key word search:", key_phrases_list[key_phrase_counter]
            # loop through the list search for keywords.
            for loop_text in page_text_list:
                if loop_text.encode('utf-8').startswith(key_phrases_list[key_phrase_counter]):
                    print loop_text
                    # append the elements to list
                    final_data_list_unchecked.append(loop_text)
    # print "Total number of signals with duplicates:", len(final_data_list_unchecked)
    # print final_data_list_unchecked
    # remove duplicates from list
    final_data_list_checked = list(dict.fromkeys(final_data_list_unchecked))
    # print "Total number of signals:", len(final_data_list_checked)
    # print final_data_list_checked
    return final_data_list_checked
Reply
#3
Update:
Hello , i temporarily found a way around this
Import retract and use pdf2text
Note this is also a bit buggy,so you may have to install the poppler binaries and execute py2text from command line.

https://stackoverflow.com/questions/1838...9#53960829
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  Style question on adherence to PEP 8 with whitespace near an "=" sign nilesh 6 3,848 Jan-12-2021, 11:11 PM
Last Post: snippsat
  How can I found how many numbers are there in a Collatz Sequence that I found? cananb 2 2,505 Nov-23-2020, 05:15 PM
Last Post: cananb
  Whitespace syntax error klp21 1 3,022 May-22-2019, 07:49 AM
Last Post: Gribouillis
  How to remove whitespace from a string when .replace and .strip do not work winnetrie 7 4,372 Jan-05-2019, 08:44 AM
Last Post: DeaD_EyE
  Geany 1.25 terminal not showing whitespace hudabaig 1 2,977 May-14-2018, 10:50 PM
Last Post: snippsat
  Escaping whitespace and parenthesis in filenames jehoshua 2 9,634 Mar-21-2018, 09:12 AM
Last Post: jehoshua
  how do one remove whitespace from a list? BoaCoder3 3 3,749 Dec-09-2017, 03:02 PM
Last Post: BoaCoder3

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020