Python Forum
extract data inside a table from a .doc file - Printable Version

+- Python Forum (https://python-forum.io)
+-- Forum: Python Coding (https://python-forum.io/forum-7.html)
+--- Forum: General Coding Help (https://python-forum.io/forum-8.html)
+--- Thread: extract data inside a table from a .doc file (/thread-8574.html)

Pages: 1 2


RE: extract data inside a table from a .doc file - buran - Mar-04-2018

another option just came to my mind. if you convert your doc file to docx, then you can open it (it's just a zip) and extract the header1.xml file and parse it. I think it will be easier to parse it, I think. That is for the header, for the table in the body, I have shown you my code.

this is how the xml file looks like.


RE: extract data inside a table from a .doc file - aster - Mar-04-2018

mmm this is the first time that i see inside a .xml file and it doesn't seem to be too complex, my data is inside the <w:t></w:t>, but i don't understand how it could be easier for me to work on this. I would rather to work on the simpler to see .txt file

by the way i think that i managed to extract all my data! during this week i will try to do it in a real file
this is my approach: i made all the string lower case to don't bother with case sensitive. Then i create an array with the start and end of my values of interest using str.find() until i don't meet a "\n" (i am quite sure that this could bring me to errors)

debug = True

text = """Lorem ipsum dolor sit amet, 
consectetur adipisci elit, 
sed eiusmod tempor incidunt ut
labore et dolore magna aliqua.""".lower()
 
#fake data that i need to find
name = "lorem"
surname = "consectetur"
birth = "labore" 


#index of them
ix_name = [text.find(name)+len(name), text.find('\n', text.find(name))]
ix_surname = [text.find(surname)+len(surname), text.find('\n', text.find(surname))]
ix_birth = [text.find(birth)+len(birth), text.find('\n', text.find(birth))]

if debug:
    print("\n\nINDEX\n")
    print("name position:", str(ix_name))
    print("surname position:", str(ix_surname))
    print("birth position:", str(ix_birth))

#extracting my data
data_name = text[ix_name[0]:ix_name[1]]
data_surname = text[ix_surname[0]:ix_surname[1]]
data_birth = text[ix_birth[0]:ix_birth[1]]

if debug:
    print("\n\nDATA\n")
    print("name:",data_name)
    print("surname:",data_surname)
    print("birth:",data_birth)
from this:
Quote:Lorem ipsum dolor sit amet,
consectetur adipisci elit,
sed eiusmod tempor incidunt ut
labore et dolore magna aliqua.

I got this:
Quote:INDEX

name position: [5, 28]
surname position: [40, 56]
birth position: [94, -1]


DATA

name: ipsum dolor sit amet,
surname: adipisci elit,
birth: et dolore magna aliqua

would you have done this in a different way?


RE: extract data inside a table from a .doc file - Larz60+ - Mar-04-2018

I'm not sure the following will work until I try it,
but seems like you should be able to use lxml to parse the socx.