Python Forum
Docx Convert Word Header to Body
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Docx Convert Word Header to Body
#1
I am trying to scrape Pdf information. I need a to collect a lot of information about each paragraph. Right now it seems that the way I can extract all the information I want is by converting the pdf into a word document and then going over each paragraph with python-docx. (other methods do not provide as many information).

Everything seems to work as intended, however, an annoying part of the pdf to docx conversion is that some parts of the text get randomly converted to headers and footers in Word if they are too close to the edge of the page. You can't read these parts while iterating over the paragraphs of the body. Python-Docx has a header-footer reader, however it does not work on all documents like on the types I read because of this issue: https://github.com/python-openxml/python...issues/868. So I can't just iterate over the header objects in each section object.

The other method to solve this would be to somehow convert the headers/footers into parts of the body text. There is probably a way in Word to do this, which I would be able to call with win32com, but I can't find it.

So, is there a way to convert a word header/footer into part of the body text, with or without Python?
Reply
#2
PDF is by nature a strange file structure. There are many forms that a file can take.
some text may actually be just that text, other may be an image. There may or may not be headers.
Same with data, it might be stored in tables (this is what we want), or it might be an image of someone's handwriting.

In my opinion, it's one of the worst methods available for storing data, yet you run across it again and again and must be able, as best as possible, to extract the data into a useful format.

There are many commercial and free packages available for achieving the desired conversion.
each does better on one type of data than others.
pdfminer.six (many other packages are built as wrappers around pdfminer.six), PyPDF2, PyPDF4, which more commonly used from amongst 279 (as of today) PDF packages available here.

I've done a lot of work with U.S. Government, national, state, and local jurisdiction data, and have had to find a different solution for many. Some instances can only be done manually, these are usually the handwritten ones that cannot be interpreted by OCR methods.

Start with the ones (like pdfminer.six and PyPDF2 and/or PyPDF4). There are tutorials for these available. If they don't work, start looking for alternatives.

Good Luck. PDF's are easy, only when the format corporates.
Reply
#3
I am fairly used the pdfminer and have also worked with PyPDF2 and pdfrw.

I have to extract each paragraph in the pdf, with as many data as possible: font, size, color, bold, italic etc, as much as possible. This is so later I can filter out the ones I want. Let's say I want to select all the text in this pdf that is size 16 with the Times new Roman Font, that has an rgb color of (255,255,255) and is bold etc. I have already created this part, I just have to include the headers somehow.

Pdfminer is almost good, but you can't extract the color and many other properties. Would love for it to work but there is just not enough data that you can extract from the LTTextLine and LTChar objects. The same is true with PyPDF2 and pdfrw.

Converting it to XML and HTML causes data loss thus those are not good alternatives either.

This is why docx is what I am looking for, it is easy to navigate and contains the most data in an easy to get format.

I have tried 3 methods of conversion:

1. pdf2docx pip package is slow and unreliable.
2. Calling Word in win32com is good but really slow.
3. Calling Adobe in win32com is good and also much faster as the others. This is the one I am using.

The problem is that all of them randomly create headers from the top of each page. This is what I am trying to solve. Is there no way to include the headers as part of the body text? Even if it is a way in Word for example, then it can be called and automated. Does nobody have any idea?

EDIT: I am only using non-scanned pdfs. So these things should be extractable.
Reply
#4
I'm not familiar with any of the MS windows software as I'm on Linux 99% of the time.
someone else will probably know more.
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  no module named 'docx' when importing docx MaartenRo 1 846 Dec-31-2023, 11:21 AM
Last Post: deanhystad
  Replace a text/word in docx file using Python Devan 4 3,293 Oct-17-2023, 06:03 PM
Last Post: Devan
Thumbs Up Convert word into pdf and copy table to outlook body in a prescribed format email2kmahe 1 742 Sep-22-2023, 02:33 PM
Last Post: carecavoador
  Review my code: convert a HTTP date header to a datetime object stevendaprano 1 1,985 Dec-17-2022, 12:24 AM
Last Post: snippsat
  python-docx regex: replace any word in docx text Tmagpy 4 2,216 Jun-18-2022, 09:12 AM
Last Post: Tmagpy
Question Problem: Check if a list contains a word and then continue with the next word Mangono 2 2,488 Aug-12-2021, 04:25 PM
Last Post: palladium
  Сombine (Merge) word documents using python-docx Lancellot 1 11,519 May-12-2021, 11:07 AM
Last Post: toothedsword
  How to use python to convert pdf to docx impact_code 3 2,629 Aug-01-2020, 01:58 PM
Last Post: Yoriz
  Write tables from Word (.docx) to Excel (.xlsx) using xlsxwriter jackie 1 3,198 May-27-2020, 11:47 PM
Last Post: mcmxl22
  Python Speech recognition, word by word AceScottie 6 15,984 Apr-12-2020, 09:50 AM
Last Post: vinayakdhage

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020