Jun-02-2021, 09:06 AM
I am trying to scrape Pdf information. I need a to collect a lot of information about each paragraph. Right now it seems that the way I can extract all the information I want is by converting the pdf into a word document and then going over each paragraph with python-docx. (other methods do not provide as many information).
Everything seems to work as intended, however, an annoying part of the pdf to docx conversion is that some parts of the text get randomly converted to headers and footers in Word if they are too close to the edge of the page. You can't read these parts while iterating over the paragraphs of the body. Python-Docx has a header-footer reader, however it does not work on all documents like on the types I read because of this issue: https://github.com/python-openxml/python...issues/868. So I can't just iterate over the header objects in each section object.
The other method to solve this would be to somehow convert the headers/footers into parts of the body text. There is probably a way in Word to do this, which I would be able to call with win32com, but I can't find it.
So, is there a way to convert a word header/footer into part of the body text, with or without Python?
Everything seems to work as intended, however, an annoying part of the pdf to docx conversion is that some parts of the text get randomly converted to headers and footers in Word if they are too close to the edge of the page. You can't read these parts while iterating over the paragraphs of the body. Python-Docx has a header-footer reader, however it does not work on all documents like on the types I read because of this issue: https://github.com/python-openxml/python...issues/868. So I can't just iterate over the header objects in each section object.
The other method to solve this would be to somehow convert the headers/footers into parts of the body text. There is probably a way in Word to do this, which I would be able to call with win32com, but I can't find it.
So, is there a way to convert a word header/footer into part of the body text, with or without Python?