Mar-19-2020, 07:42 AM
Hello,
I am facing the same issue with certain pdfs.
I have tried the following: I tried changing the pdf formats to PDF/A from here:
https://www.pdftron.com/pdf-tools/pdfa-converter/
But when i tried to extract the text ,it was completely blank
Is there any other alternative?
I am trying to extract certain words from the pdf and transfer them to a list.
If there is abetter way of doing this linkly let me know,Here is my code.
I am facing the same issue with certain pdfs.
I have tried the following: I tried changing the pdf formats to PDF/A from here:
https://www.pdftron.com/pdf-tools/pdfa-converter/
But when i tried to extract the text ,it was completely blank

Is there any other alternative?
I am trying to extract certain words from the pdf and transfer them to a list.
If there is abetter way of doing this linkly let me know,Here is my code.
def read_data_from_pdf_using_pypdf(key_phrases_list, file_name): print "File Name:", file_name final_data_list_unchecked = [] # open the pdf file object = PyPDF2.PdfFileReader(file_name) # get number of pages NumPages = object.getNumPages() print "Number of pages ", NumPages for page_number_counter in range(NumPages): if page_number_counter is 1: print "Please wait..." # print "Page Number:", page_number_counter # get page object PageObj = object.getPage(page_number_counter) # extract text from page object text = PageObj.extractText() # split texts and add them to a list page_text_list = text.split() # print page_text_list print "Length of list is", len(page_text_list) # loops through all elements in the key phrase list for key_phrase_counter in range(len(key_phrases_list)): # print "Key phrase counter loop:", key_phrase_counter # print "Key word search:", key_phrases_list[key_phrase_counter] # loop through the list search for keywords. for loop_text in page_text_list: if loop_text.encode('utf-8').startswith(key_phrases_list[key_phrase_counter]): print loop_text # append the elements to list final_data_list_unchecked.append(loop_text) # print "Total number of signals with duplicates:", len(final_data_list_unchecked) # print final_data_list_unchecked # remove duplicates from list final_data_list_checked = list(dict.fromkeys(final_data_list_unchecked)) # print "Total number of signals:", len(final_data_list_checked) # print final_data_list_checked return final_data_list_checked