Jan-10-2023, 04:28 PM
Hi,
This time I came across a massive contingent of legacy .doc files (not .docx)
Unlike .xml files, that you can read easily, with a choice of tools,
.doc files prove to be difficult.
Textract is reputed to do the job, but you need all sorts of strange softwares to make it work.
I tried this, and it works in principle, but i cannot find a way to close the word document.
So it opens hundreds simultaneously.
All I need is the text, never mind any font or formatting, just the text.
thx,
Paul
This time I came across a massive contingent of legacy .doc files (not .docx)
Unlike .xml files, that you can read easily, with a choice of tools,
.doc files prove to be difficult.
Textract is reputed to do the job, but you need all sorts of strange softwares to make it work.
I tried this, and it works in principle, but i cannot find a way to close the word document.
So it opens hundreds simultaneously.
import win32com.client word = win32com.client.DispatchEx("Word.Application") word.visible = False # does not seem to work, because word shows wb = word.Documents.Open(docpath) doc = word.ActiveDocument text = doc.Range().TextAnybody know what and how to close: word ? doc ? wb ?
All I need is the text, never mind any font or formatting, just the text.
thx,
Paul