Python Forum
Superfluous whitespace found?
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Superfluous whitespace found?
#2
Hello,
I am facing the same issue with certain pdfs.
I have tried the following: I tried changing the pdf formats to PDF/A from here:
https://www.pdftron.com/pdf-tools/pdfa-converter/
But when i tried to extract the text ,it was completely blank Sad

Is there any other alternative?

I am trying to extract certain words from the pdf and transfer them to a list.
If there is abetter way of doing this linkly let me know,Here is my code.

def read_data_from_pdf_using_pypdf(key_phrases_list, file_name):
    print "File Name:", file_name
    final_data_list_unchecked = []
    # open the pdf file
    object = PyPDF2.PdfFileReader(file_name)

    # get number of pages
    NumPages = object.getNumPages()
    print "Number of pages ", NumPages
    for page_number_counter in range(NumPages):
        if page_number_counter is 1:
            print "Please wait..."
        # print "Page Number:", page_number_counter
        # get page object
        PageObj = object.getPage(page_number_counter)
        # extract text from page object
        text = PageObj.extractText()
        # split texts and add them to a list
        page_text_list = text.split()

        # print page_text_list
        print "Length of list is", len(page_text_list)

        # loops through all elements in the key phrase list
        for key_phrase_counter in range(len(key_phrases_list)):
            # print "Key phrase counter loop:", key_phrase_counter
            # print "Key word search:", key_phrases_list[key_phrase_counter]
            # loop through the list search for keywords.
            for loop_text in page_text_list:
                if loop_text.encode('utf-8').startswith(key_phrases_list[key_phrase_counter]):
                    print loop_text
                    # append the elements to list
                    final_data_list_unchecked.append(loop_text)
    # print "Total number of signals with duplicates:", len(final_data_list_unchecked)
    # print final_data_list_unchecked
    # remove duplicates from list
    final_data_list_checked = list(dict.fromkeys(final_data_list_unchecked))
    # print "Total number of signals:", len(final_data_list_checked)
    # print final_data_list_checked
    return final_data_list_checked
Reply


Messages In This Thread
Superfluous whitespace found? - by CaptainCsaba - Feb-13-2019, 01:54 PM
RE: Superfluous whitespace found? - by ak52 - Mar-19-2020, 07:42 AM
RE: Superfluous whitespace found? - by ak52 - Mar-19-2020, 09:04 AM

Possibly Related Threads…
Thread Author Replies Views Last Post
  Style question on adherence to PEP 8 with whitespace near an "=" sign nilesh 6 6,030 Jan-12-2021, 11:11 PM
Last Post: snippsat
  How can I found how many numbers are there in a Collatz Sequence that I found? cananb 2 3,603 Nov-23-2020, 05:15 PM
Last Post: cananb
  Whitespace syntax error klp21 1 3,713 May-22-2019, 07:49 AM
Last Post: Gribouillis
  How to remove whitespace from a string when .replace and .strip do not work winnetrie 7 6,107 Jan-05-2019, 08:44 AM
Last Post: DeaD_EyE
  Geany 1.25 terminal not showing whitespace hudabaig 1 3,703 May-14-2018, 10:50 PM
Last Post: snippsat
  Escaping whitespace and parenthesis in filenames jehoshua 2 11,928 Mar-21-2018, 09:12 AM
Last Post: jehoshua
  how do one remove whitespace from a list? BoaCoder3 3 4,647 Dec-09-2017, 03:02 PM
Last Post: BoaCoder3

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020