Hi,
This time I came across a massive contingent of legacy .doc files (not .docx)
Unlike .xml files, that you can read easily, with a choice of tools,
.doc files prove to be difficult.
Textract is reputed to do the job, but you need all sorts of strange softwares to make it work.
I tried this, and it works in principle, but i cannot find a way to close the word document.
So it opens hundreds simultaneously.
All I need is the text, never mind any font or formatting, just the text.
thx,
Paul
This time I came across a massive contingent of legacy .doc files (not .docx)
Unlike .xml files, that you can read easily, with a choice of tools,
.doc files prove to be difficult.
Textract is reputed to do the job, but you need all sorts of strange softwares to make it work.
I tried this, and it works in principle, but i cannot find a way to close the word document.
So it opens hundreds simultaneously.
import win32com.client word = win32com.client.DispatchEx("Word.Application") word.visible = False # does not seem to work, because word shows wb = word.Documents.Open(docpath) doc = word.ActiveDocument text = doc.Range().TextAnybody know what and how to close: word ? doc ? wb ?
All I need is the text, never mind any font or formatting, just the text.
thx,
Paul
It is more important to do the right thing, than to do the thing right.(P.Drucker)
Better is the enemy of good. (Montesquieu) = French version for 'kiss'.
Better is the enemy of good. (Montesquieu) = French version for 'kiss'.