Python Forum
Data extraction from (multiple) MS Word file(s) in python
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Data extraction from (multiple) MS Word file(s) in python
#9
Hello everybody,

As a part of a project I need to make a corpus that - later on - will be used to train an NLP-model for named entity recognition etc.
I'm a little bit stuck with my basic knowledge from python-programming, or let's say programming in general right now.

What I have available: A folder with a bunch of Word documents (.docx)
What I did: a lot of research on working with documents in python, "walking" through the files in a folder, (nested) for loops in python, lists, dictionaries, how to get access to attributes from the documents, what is "JSON", etc. + ofcourse i've tried to write some working code already as well:

# In Pycharm:
import os
import docx

path = '/path/to/FolderWithDocuments'
for filename in os.listdir(path):
    if filename.endswith('.docx'):
        path = os.path.join(path, filename)
This code "walks" over the Word documents in the folder. From these word documents i'll have to extract the raw text as well as the title (and btw, the title is not the first line in the document, the title is the name of the document itself). I've been told I should figure out how i need to do this with python-docx but other tips and ideas are also welcome. Idealy i'd end up with a json-file which later can be used in NLP.

If I go in terminal in linux and I open python there and I write:
import docx
d=docx.Document('NameOfDocument.docx')
dir(d)
texts=[p.text for p in d.paragraphs]
texts
Then my output is written like this:
['','','some tekst in the document', '', 'some more text', ........ , '\t\t\tStW some more text', '','','','']
And at this moment I really have no clue, which steps I'm going to need to take to go from this to a nice corpus to use for NLP. I have the feeling that I'm going to need to make a list of lists and then make a dictionary from these lists and in the end converting this dictionary into a JSON-file somehow but i'm really stuck here. Maybe there is an easier way, because I'm getting confused with this making a list of lists and then how am I going to do this while looping over this batch of documents...

I hope I've given enough information and that everything in my explanation is clear.
Thank you very much in advance to take your time for this :-)
Reply


Messages In This Thread
Text-extraction from multiple doc's using python-docx for corpus NLP - by Den0st - Sep-19-2019, 10:07 PM

Possibly Related Threads…
Thread Author Replies Views Last Post
  Read data from a CSV file in S3 bucket and store it in a dictionary in python Rupini 3 7,136 May-15-2020, 04:57 PM
Last Post: snippsat
  Multiple XML file covert to CSV output file krish143 1 3,411 Jul-27-2018, 06:55 PM
Last Post: ichabod801
  Login Module Help - Comparing data in a text file to data held in a variable KeziaKar 0 2,291 Mar-08-2018, 11:41 AM
Last Post: KeziaKar

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020