Python Forum
Docx Convert Word Header to Body
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Docx Convert Word Header to Body
#1
I am trying to scrape Pdf information. I need a to collect a lot of information about each paragraph. Right now it seems that the way I can extract all the information I want is by converting the pdf into a word document and then going over each paragraph with python-docx. (other methods do not provide as many information).

Everything seems to work as intended, however, an annoying part of the pdf to docx conversion is that some parts of the text get randomly converted to headers and footers in Word if they are too close to the edge of the page. You can't read these parts while iterating over the paragraphs of the body. Python-Docx has a header-footer reader, however it does not work on all documents like on the types I read because of this issue: https://github.com/python-openxml/python...issues/868. So I can't just iterate over the header objects in each section object.

The other method to solve this would be to somehow convert the headers/footers into parts of the body text. There is probably a way in Word to do this, which I would be able to call with win32com, but I can't find it.

So, is there a way to convert a word header/footer into part of the body text, with or without Python?
Reply


Messages In This Thread
Docx Convert Word Header to Body - by CaptainCsaba - Jun-02-2021, 09:06 AM
RE: Docx Convert Word Header to Body - by Larz60+ - Jun-02-2021, 10:21 AM
RE: Docx Convert Word Header to Body - by Larz60+ - Jun-02-2021, 01:25 PM

Possibly Related Threads…
Thread Author Replies Views Last Post
  no module named 'docx' when importing docx MaartenRo 1 1,388 Dec-31-2023, 11:21 AM
Last Post: deanhystad
  Replace a text/word in docx file using Python Devan 4 4,654 Oct-17-2023, 06:03 PM
Last Post: Devan
Thumbs Up Convert word into pdf and copy table to outlook body in a prescribed format email2kmahe 1 903 Sep-22-2023, 02:33 PM
Last Post: carecavoador
  Review my code: convert a HTTP date header to a datetime object stevendaprano 1 2,265 Dec-17-2022, 12:24 AM
Last Post: snippsat
  python-docx regex: replace any word in docx text Tmagpy 4 2,478 Jun-18-2022, 09:12 AM
Last Post: Tmagpy
Question Problem: Check if a list contains a word and then continue with the next word Mangono 2 2,668 Aug-12-2021, 04:25 PM
Last Post: palladium
  Сombine (Merge) word documents using python-docx Lancellot 1 11,994 May-12-2021, 11:07 AM
Last Post: toothedsword
  How to use python to convert pdf to docx impact_code 3 2,793 Aug-01-2020, 01:58 PM
Last Post: Yoriz
  Write tables from Word (.docx) to Excel (.xlsx) using xlsxwriter jackie 1 3,370 May-27-2020, 11:47 PM
Last Post: mcmxl22
  Python Speech recognition, word by word AceScottie 6 16,304 Apr-12-2020, 09:50 AM
Last Post: vinayakdhage

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020