Python Forum
Docx Convert Word Header to Body
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Docx Convert Word Header to Body
#2
PDF is by nature a strange file structure. There are many forms that a file can take.
some text may actually be just that text, other may be an image. There may or may not be headers.
Same with data, it might be stored in tables (this is what we want), or it might be an image of someone's handwriting.

In my opinion, it's one of the worst methods available for storing data, yet you run across it again and again and must be able, as best as possible, to extract the data into a useful format.

There are many commercial and free packages available for achieving the desired conversion.
each does better on one type of data than others.
pdfminer.six (many other packages are built as wrappers around pdfminer.six), PyPDF2, PyPDF4, which more commonly used from amongst 279 (as of today) PDF packages available here.

I've done a lot of work with U.S. Government, national, state, and local jurisdiction data, and have had to find a different solution for many. Some instances can only be done manually, these are usually the handwritten ones that cannot be interpreted by OCR methods.

Start with the ones (like pdfminer.six and PyPDF2 and/or PyPDF4). There are tutorials for these available. If they don't work, start looking for alternatives.

Good Luck. PDF's are easy, only when the format corporates.
Reply


Messages In This Thread
Docx Convert Word Header to Body - by CaptainCsaba - Jun-02-2021, 09:06 AM
RE: Docx Convert Word Header to Body - by Larz60+ - Jun-02-2021, 10:21 AM
RE: Docx Convert Word Header to Body - by Larz60+ - Jun-02-2021, 01:25 PM

Possibly Related Threads…
Thread Author Replies Views Last Post
  no module named 'docx' when importing docx MaartenRo 1 1,388 Dec-31-2023, 11:21 AM
Last Post: deanhystad
  Replace a text/word in docx file using Python Devan 4 4,655 Oct-17-2023, 06:03 PM
Last Post: Devan
Thumbs Up Convert word into pdf and copy table to outlook body in a prescribed format email2kmahe 1 903 Sep-22-2023, 02:33 PM
Last Post: carecavoador
  Review my code: convert a HTTP date header to a datetime object stevendaprano 1 2,265 Dec-17-2022, 12:24 AM
Last Post: snippsat
  python-docx regex: replace any word in docx text Tmagpy 4 2,478 Jun-18-2022, 09:12 AM
Last Post: Tmagpy
Question Problem: Check if a list contains a word and then continue with the next word Mangono 2 2,668 Aug-12-2021, 04:25 PM
Last Post: palladium
  Сombine (Merge) word documents using python-docx Lancellot 1 11,994 May-12-2021, 11:07 AM
Last Post: toothedsword
  How to use python to convert pdf to docx impact_code 3 2,793 Aug-01-2020, 01:58 PM
Last Post: Yoriz
  Write tables from Word (.docx) to Excel (.xlsx) using xlsxwriter jackie 1 3,370 May-27-2020, 11:47 PM
Last Post: mcmxl22
  Python Speech recognition, word by word AceScottie 6 16,304 Apr-12-2020, 09:50 AM
Last Post: vinayakdhage

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020