extract data inside a table from a .doc file

aster · (This post was last modified: Feb-26-2018, 11:28 PM by aster.)

i have more then 4000 hatefull Microsoft Office Word .doc files from which i should extract some data (both numbers and words but really in most cases there are empty spaces) and later convert to a single .csv file where every row would be a single .doc file

here there is a screen of one of those files, underlined in blue there are some example of what i should extract:
[Image: Immagine.jpg]

here i uploaded the file if someone wants to test something
https://ufile.io/vt2zq

So since my experience with python is quite little i thought it would be useful to came here before starting to gather some idea and hints
from google i know that my possibility to work with this file format are not so much
1) textract
2) convert the .doc to .docx with antiword and then use docx2txt

my idea was to:
1) open the folder and read the first .doc file
1) extract the data and handle the many empty values with a try/except
2) go to the next file

right now i doesn't have any idea on how to get to any of those points. what would you do in my situation? how would you open the files? how would you procede?

**Larz60+** · Feb-27-2018, 12:23 AM

so what gave you tried so far?

If you're looking for someone to do it for you, it should be posted under jobs.
The thread can be moved if requested.

aster · Feb-28-2018, 12:58 PM

no, i am not asking to have the job done i would do it by myself

i was asking to suggestion because i am unable to even open correctly a single file

so far i tried to
1) convert it using antiword -> didn't work
2) open with textract, i discovered that it used antiword so -> didn't work
3) tested to convert the file with soffice --convert-to odt *.doc better then before but -> didn't work
4) tested about another 3/4 method found on google but any worked

but now i think i found the problem, it is that i need to take some data from the heading
and it is treated as a "outside the margin" in word file so any of this method "see" it

if someone wants to try something: test file

**Larz60+** · (This post was last modified: Feb-28-2018, 01:29 PM by Larz60+.)

see if you can use this: https://github.com/python-openxml/python-docx
aparently you can only write, not read with this package.

This is an evolving post, sorry for that.
this looks like the best bet: https://pypi.python.org/pypi/pywin32/223

**buran** · Feb-28-2018, 01:41 PM

if it were docx files you can use python-docx - I used it (in this case Larz60+ is not correct that it is just for writing)
if its doc file one option is to use pywin32 as suggested by Larz60+

aster · (This post was last modified: Feb-28-2018, 01:51 PM by aster.)

larz60+ you don't have to say sorry, instead thank you to both of you to answer!

i tried with python-docx but i wasn't able to read anything, as larz said i think it is mainly for writing docx files. The only example in the documentation i found is this but maybe i wasn't able to have it working

import os, io
from docx import Document

folderPatch = os.getcwd()
filePatch = folderPatch+"/test.docx"

with open(filePatch, 'rb') as f:
    source_stream = io.BytesIO(f.read())
    print(source_stream)
document = Document(source_stream)
source_stream.close()

about pywin32 i am having some troubble to instal it

**buran** · Feb-28-2018, 01:58 PM

EDIT: Now I see you need info from header. This is not implemented yet in python-docx package

aster · (This post was last modified: Feb-28-2018, 02:06 PM by aster.)

solved.

i need to convert .doc to .docx with soffice --convert-to odt *.doc
then

import os
import docx2txt

folderPatch = os.getcwd()
filePatch = folderPatch+"/test.docx"

text = docx2txt.process(filePatch)
print (text)

**buran** · (This post was last modified: Feb-28-2018, 02:08 PM by buran.)

here is example that reads the test file if converted to docx

from docx import Document

document = Document('test.docx')
tbl = document.tables[0]
for rw in tbl.rows:
    if rw.cells[0].text.startswith('CONCLUSIONI:'):
        print(rw.cells[0].text)

Output:CONCLUSIONI: 16. Quadro microcircolatorio con segni aspecifici ma evidenti compa
tibili con connettivopatia
22. Quadro microcircolatorio compatibile con Raynaud
42. Si consiglia visita reumatologica ed eseguire i seguenti esami di laboratori
o:
 esame emocromocitometrico completo con formula leucocitaria, VES, PCR, Reuma Te
st, C3, C4, ANA, ANCA,ASMA, ENA Profile, Sideremia, ferritinemia.
>>>

aster · (This post was last modified: Mar-04-2018, 05:46 PM by aster.)

I am moving to the second part of my project: extracting the values I need from the file!

To do that i am checking inside the newly created string with str.find() then i try to understand where my data start and end
I made an example of what i am doing, but i am sure there is a better way to handle this and that my coding style is not very pythonic
meanwhile i am learning about the re library!

import re
#https://docs.python.org/3/library/re.html
#r"\n" is a two-character string containing '\' and 'n', while "\n" is a one-character string containing a newline
#check special characters and put a "\" before them!

text = "Lorem ipsum dolor sit amet, consectetur adipisci elit, sed eiusmod tempor incidunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrum exercitationem ullam corporis suscipit laboriosam, nisi ut aliquid ex ea commodi consequatur. Quis aute iure reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint obcaecat cupiditat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum."

#print (text)
debug = True
#re.sub(pattern_to_find, replace_with, text_input, count=0, flags=0)
text = re.sub('(\. |, )', '.\n', text)

print (text)

#data that i need to find
name = "incidunt"
surname = "consectetur"
birth = "veniam" 

#index of them
ix_name = text.find(name)+len(name)
ix_surname = text.find(surname)+len(surname)
ix_birth = text.find(birth)+len(birth)

if debug:
    print("start of data:")
    print("name position:", str(ix_name))
    
#my data
data_name = text[ix_name:ix_name+20]
data_surname = text[ix_surname:ix_surname+20]
if debug:
    print(data_name)

as always any suggestion to better approach, libraries, examples is very welcomed :D

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	Why can't it extract the data from .txt well?	Melcu54	4	1,897	Dec-12-2024, 07:36 PM Last Post: Melcu54
	JSON File - extract only the data in a nested array for CSV file	shwfgd	2	1,203	Aug-26-2024, 10:14 PM Last Post: shwfgd
	Python script to extract data from API to database	melpys	0	1,102	Aug-12-2024, 05:53 PM Last Post: melpys
	Extract and rename a file from an Archive	tester_V	4	4,143	Jul-08-2024, 07:54 AM Last Post: tester_V
	Is it possible to extract 1 or 2 bits of data from MS project files?	cubangt	8	4,107	Feb-16-2024, 12:02 AM Last Post: deanhystad
	Navigating file directories and paths inside Jupyter Notebook	Mark17	5	9,914	Oct-29-2023, 12:40 PM Last Post: Mark17
	Using pyodbc&pandas to load a Table data to df	tester_V	3	3,125	Sep-09-2023, 08:55 PM Last Post: tester_V
	Extract file only (without a directory it is in) from ZIPIP	tester_V	1	4,470	Jan-23-2023, 04:56 AM Last Post: deanhystad
	extract table from multiple pages	sshree43	8	10,081	Dec-12-2022, 10:34 AM Last Post: arvin
	Reading All The RAW Data Inside a PDF	NBAComputerMan	4	3,367	Nov-30-2022, 10:54 PM Last Post: Larz60+

extract data inside a table from a .doc file

User Panel Messages

Announcements