Search text in PDF and output its page number. - Printable Version +- Python Forum (https://python-forum.io) +-- Forum: Python Coding (https://python-forum.io/forum-7.html) +--- Forum: General Coding Help (https://python-forum.io/forum-8.html) +--- Thread: Search text in PDF and output its page number. (/thread-35995.html) |
Search text in PDF and output its page number. - atomxkai - Jan-07-2022 Hello, I'm trying to use this script to search a word or text in a PDF and will output the text and its page number. Please note: I'm using Windows import PyPDF2 import re pdfFileObj=open(r'C:\python\document.pdf',mode='rb') pdfReader=PyPDF2.PdfFileReader(pdfFileObj) number_of_pages=pdfReader.numPages pages_text=[] words_start_pos={} words={} searchwords=['Earth'] with open('Results.csv', 'w') as f: f.write('{0},{1}\n'.format("Page Number", "Search")) for word in searchwords: for page in range(number_of_pages): print(page) pages_text.append(pdfReader.getPage(page).extractText()) words_start_pos[page]=[dwg.start() for dwg in re.finditer(word, pages_text[page].lower())] words[page]=[pages_text[page][value:value+len(word)] for value in words_start_pos[page]] for page in words: for i in range(0,len(words[page])): if str(words[page][i]) != 'nan': f.write('{0},{1}\n'.format(page+1, words[page][i])) print(page, words[page][i])Problem: Results is not showing. https://imgur.com/a/yAD97mN RE: Search text in PDF and output its page number. - cubangt - Jan-07-2022 When you run the logic, are your variables being populated? Like you have print(page) does that print out a number? What if you add a print(word) right below the for statement and see if that contains anything Since you are looping thru those searchwords and page numbers, if those are blank, then you wont get any results. RE: Search text in PDF and output its page number. - BashBedlam - Jan-07-2022 Try a different PDF. Your code works as expected for me with some PDFs but not others. Some PDFs are made entirely of images of words so only OCR would be able to retrieve those. RE: Search text in PDF and output its page number. - atomxkai - Jan-08-2022 (Jan-07-2022, 11:21 PM)BashBedlam Wrote: Try a different PDF. Your code works as expected for me with some PDFs but not others. Some PDFs are made entirely of images of words so only OCR would be able to retrieve those. I learned that it mostly works in Mac than in Windows? But I will try different pdf. Thanks. RE: Search text in PDF and output its page number. - atomxkai - Jan-08-2022 (Jan-08-2022, 12:26 AM)atomxkai Wrote:(Jan-07-2022, 11:21 PM)BashBedlam Wrote: Try a different PDF. Your code works as expected for me with some PDFs but not others. Some PDFs are made entirely of images of words so only OCR would be able to retrieve those. RE: Search text in PDF and output its page number. - atomxkai - Jan-08-2022 (Jan-07-2022, 07:19 PM)cubangt Wrote: When you run the logic, are your variables being populated? i tried to add print(word) but no output only the header. it seems i'll try to use different PDF RE: Search text in PDF and output its page number. - snippsat - Jan-08-2022 pdfplumber may be a better tool for this and more updated, it's been 5-6 year since PyPDF2 was updated and has stuff that is none pythonic like CamelCase🐫 usage everywhere. So i can write a quick test for this task,using this sample pdf. import pdfplumber pdf_file = "sample.pdf" search_word = 'end' with pdfplumber.open(pdf_file) as pdf: pages = pdf.pages for page_nr, pg in enumerate(pages, 1): content = pg.extract_text() if search_word in content: print(f'<{search_word}> found at page number <{page_nr}> '\ f'at index <{content.index(search_word)}>')
RE: Search text in PDF and output its page number. - BashBedlam - Jan-08-2022 @snippsat I liked your post. Just out of curiosity, how would you find a second or third occurrence of the same word on the same page? RE: Search text in PDF and output its page number. - snippsat - Jan-08-2022 (Jan-08-2022, 03:07 PM)BashBedlam Wrote: @snippsat I liked your post. Just out of curiosity, how would you find a second or third occurrence of the same word on the same page?Can split up content then loop over that list. Example. import pdfplumber pdf_file = "sample.pdf" search_word = 'more' with pdfplumber.open(pdf_file) as pdf: pages = pdf.pages for page_nr, pg in enumerate(pages, 1): content = pg.extract_text().split() for word in content: if search_word in word: print(f'<{search_word}> found on page {page_nr}') Eg collect in a list list and count words found.import pdfplumber pdf_file = "sample.pdf" search_word = 'more' lst = [] with pdfplumber.open(pdf_file) as pdf: pages = pdf.pages for page_nr, pg in enumerate(pages): content = pg.extract_text().split() for word in content: if search_word in word: lst.append(search_word) print(f'Search word <{search_word}> found {len(lst)} times in {pdf_file}')
RE: Search text in PDF and output its page number. - atomxkai - Jan-10-2022 (Jan-08-2022, 10:09 AM)snippsat Wrote: pdfplumber may be a better tool for this and more updated, THIS Actually Works!!! awesome thank you so much. genius. can i use this? still hoping if i can fix the original code though. |