Python Forum
Data extraction from (multiple) MS Word file(s) in python
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Data extraction from (multiple) MS Word file(s) in python
#1
Hello everyone!

I'm learning Python programming as a part of my thesis subject so I'm inexperienced at this moment because I just started. (I do have some experience in other programming languages like C, java, C# anyways)

The first few things I need to do in this project is like this:
I have a bunch of MS Word-files (.docx) and I need to extract all the text from these files so I can feed this text to spacy NLP to be able to annotate using named entity recognition.

First I started to figure out how to extract text from 1 Word document. I figured out that python-docx would be the easiest way to use for that so I did it like this:
import docx

     def getText(filename):
     doc = docx.Document(filename)
     fullText = []
     for para in doc.paragraphs:
          fullText.append(para.text)
     return '\n'.join(fullText)
(which i've found in this source: link)

But I wanted to test if this actualy works so I added this to my program:

def main():
     txt = getText('worddocument.docx')
     print(txt)
As a result this doesn't print anything, doesn't give me an output and I have no what I'm actualy doing wrong and if this is the right way to extract text from a word document.

Also, I do actually need to figure out what is the best way to extract text from multiple documents. I already googled on that topic but I have no plan about that yet. I guess I'll probably need to use a for-loop to use this getText-function on all of the documents in a certain folder, to extract all of the text from these documents but that's for later. First things first.

Anyone any ideas/recommendations?
Reply


Messages In This Thread
Data extraction from (multiple) MS Word file(s) in python - by Den0st - Sep-15-2019, 11:47 AM

Possibly Related Threads…
Thread Author Replies Views Last Post
  Read data from a CSV file in S3 bucket and store it in a dictionary in python Rupini 3 9,732 May-15-2020, 04:57 PM
Last Post: snippsat
  Multiple XML file covert to CSV output file krish143 1 4,107 Jul-27-2018, 06:55 PM
Last Post: ichabod801
  Login Module Help - Comparing data in a text file to data held in a variable KeziaKar 0 2,713 Mar-08-2018, 11:41 AM
Last Post: KeziaKar

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020