Python Forum
extract data inside a table from a .doc file
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
extract data inside a table from a .doc file
#11
another option just came to my mind. if you convert your doc file to docx, then you can open it (it's just a zip) and extract the header1.xml file and parse it. I think it will be easier to parse it, I think. That is for the header, for the table in the body, I have shown you my code.

this is how the xml file looks like.
Reply
#12
mmm this is the first time that i see inside a .xml file and it doesn't seem to be too complex, my data is inside the <w:t></w:t>, but i don't understand how it could be easier for me to work on this. I would rather to work on the simpler to see .txt file

by the way i think that i managed to extract all my data! during this week i will try to do it in a real file
this is my approach: i made all the string lower case to don't bother with case sensitive. Then i create an array with the start and end of my values of interest using str.find() until i don't meet a "\n" (i am quite sure that this could bring me to errors)

debug = True

text = """Lorem ipsum dolor sit amet, 
consectetur adipisci elit, 
sed eiusmod tempor incidunt ut
labore et dolore magna aliqua.""".lower()
 
#fake data that i need to find
name = "lorem"
surname = "consectetur"
birth = "labore" 


#index of them
ix_name = [text.find(name)+len(name), text.find('\n', text.find(name))]
ix_surname = [text.find(surname)+len(surname), text.find('\n', text.find(surname))]
ix_birth = [text.find(birth)+len(birth), text.find('\n', text.find(birth))]

if debug:
    print("\n\nINDEX\n")
    print("name position:", str(ix_name))
    print("surname position:", str(ix_surname))
    print("birth position:", str(ix_birth))

#extracting my data
data_name = text[ix_name[0]:ix_name[1]]
data_surname = text[ix_surname[0]:ix_surname[1]]
data_birth = text[ix_birth[0]:ix_birth[1]]

if debug:
    print("\n\nDATA\n")
    print("name:",data_name)
    print("surname:",data_surname)
    print("birth:",data_birth)
from this:
Quote:Lorem ipsum dolor sit amet,
consectetur adipisci elit,
sed eiusmod tempor incidunt ut
labore et dolore magna aliqua.

I got this:
Quote:INDEX

name position: [5, 28]
surname position: [40, 56]
birth position: [94, -1]


DATA

name: ipsum dolor sit amet,
surname: adipisci elit,
birth: et dolore magna aliqua

would you have done this in a different way?
Reply
#13
I'm not sure the following will work until I try it,
but seems like you should be able to use lxml to parse the socx.
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  Why can't it extract the data from .txt well? Melcu54 4 1,775 Dec-12-2024, 07:36 PM
Last Post: Melcu54
  JSON File - extract only the data in a nested array for CSV file shwfgd 2 1,069 Aug-26-2024, 10:14 PM
Last Post: shwfgd
  Python script to extract data from API to database melpys 0 858 Aug-12-2024, 05:53 PM
Last Post: melpys
  Extract and rename a file from an Archive tester_V 4 3,690 Jul-08-2024, 07:54 AM
Last Post: tester_V
  Is it possible to extract 1 or 2 bits of data from MS project files? cubangt 8 3,638 Feb-16-2024, 12:02 AM
Last Post: deanhystad
  Navigating file directories and paths inside Jupyter Notebook Mark17 5 8,281 Oct-29-2023, 12:40 PM
Last Post: Mark17
  Using pyodbc&pandas to load a Table data to df tester_V 3 2,799 Sep-09-2023, 08:55 PM
Last Post: tester_V
  Extract file only (without a directory it is in) from ZIPIP tester_V 1 3,974 Jan-23-2023, 04:56 AM
Last Post: deanhystad
  extract table from multiple pages sshree43 8 9,207 Dec-12-2022, 10:34 AM
Last Post: arvin
  Reading All The RAW Data Inside a PDF NBAComputerMan 4 3,106 Nov-30-2022, 10:54 PM
Last Post: Larz60+

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020