Data extraction from (multiple) MS Word file(s) in python

Den0st · (This post was last modified: Sep-20-2019, 03:07 AM by buran.)

Hello everybody,

As a part of a project I need to make a corpus that - later on - will be used to train an NLP-model for named entity recognition etc.
I'm a little bit stuck with my basic knowledge from python-programming, or let's say programming in general right now.

What I have available: A folder with a bunch of Word documents (.docx)
What I did: a lot of research on working with documents in python, "walking" through the files in a folder, (nested) for loops in python, lists, dictionaries, how to get access to attributes from the documents, what is "JSON", etc. + ofcourse i've tried to write some working code already as well:

# In Pycharm:
import os
import docx

path = '/path/to/FolderWithDocuments'
for filename in os.listdir(path):
    if filename.endswith('.docx'):
        path = os.path.join(path, filename)

This code "walks" over the Word documents in the folder. From these word documents i'll have to extract the raw text as well as the title (and btw, the title is not the first line in the document, the title is the name of the document itself). I've been told I should figure out how i need to do this with python-docx but other tips and ideas are also welcome. Idealy i'd end up with a json-file which later can be used in NLP.

If I go in terminal in linux and I open python there and I write:

import docx
d=docx.Document('NameOfDocument.docx')
dir(d)
texts=[p.text for p in d.paragraphs]
texts

Then my output is written like this:

['','','some tekst in the document', '', 'some more text', ........ , '\t\t\tStW some more text', '','','','']

And at this moment I really have no clue, which steps I'm going to need to take to go from this to a nice corpus to use for NLP. I have the feeling that I'm going to need to make a list of lists and then make a dictionary from these lists and in the end converting this dictionary into a JSON-file somehow but i'm really stuck here. Maybe there is an easier way, because I'm getting confused with this making a list of lists and then how am I going to do this while looping over this batch of documents...

I hope I've given enough information and that everything in my explanation is clear.
Thank you very much in advance to take your time for this :-)

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	Read data from a CSV file in S3 bucket and store it in a dictionary in python	Rupini	3	7,136	May-15-2020, 04:57 PM Last Post: snippsat
	Multiple XML file covert to CSV output file	krish143	1	3,411	Jul-27-2018, 06:55 PM Last Post: ichabod801
	Login Module Help - Comparing data in a text file to data held in a variable	KeziaKar	0	2,291	Mar-08-2018, 11:41 AM Last Post: KeziaKar

Data extraction from (multiple) MS Word file(s) in python

User Panel Messages

Announcements