Extracting parts of paragraphs from word documents using python-docx library & lists

Den0st · Nov-06-2019, 12:07 AM

About the type of documents i'm working with, the general layout looks like this: [Image: WWk87zV]

Background of my problem?
As a part of a project in text-classification I need to extract text-data from Word documents because I'll need to annotate these documents on paragraph level which means my end result of the part of code that I'm working on now will be a jsonlines-file filled with json-objects for each document.

What I already did.
In a part of my code I extract all the text which is located under every heading separately and saved this text in a list. That works fine but the problem is that the text under heading 1 (see picture) actualy exists out of 2 paragraphs and I need to split these paragraphs as well so they can be written in a separated json-string & later on be separated annotated as well. I have the feeling that I need to work with "list of lists" in there as well as with the .splitlines()-method but I'm making a mess there because i'm inexperienced with Python and working with list from list is a bit confusing for me at this point.

I'll start to show the function that I wrote to extract just the "big piece" of text under the headings from a document. I didn't write to a jsonlines-file in my code yet but with doing that I don't have issues. I'm using print() methods to test my output until I get the right output...

def writeParas(filename):
     heading = ""                    #Here the headings will be saved while iterating over the lines from the document
     paragraphs = []                 #List to stack the content from the headings
     part_Paragraphs = []            #Not used in this part of code, this is the part where my problems are...
     paraNr = 0                      #To count which line in the file has been reached
     p_text = ""                     
     startNum = re.compile(r'\A\d+') #To detect a heading by it's start with a number
     d = Document(path + filename)
     amount_of_para = len(d.paragraphs)

     for para in d.paragraphs:
          paraNr += 1

          if(paraNr == 1): 
               title = para.text
          
          elif(paraNr > 1 and startNum.match(para.text)):
               heading = ""
               if p_text != "":
                    headings.append(patext)
               p_text = ""

               for run in para.runs: 
                    if run.bold and run.underline:
                         headings += " "+run.text
                    else:
                         headings = "This is not a heading"

          elif (paraNr > 1):
               if para.text == ""
                    p_text += "\n" 
               elif len(para.text) > 6:   
                    p_text += " "+para.text
                    if paraNr == aantalPara-1:
                         headings.append(patext)

Description Python-code:
With python-docx you can read each line in a document separately. These lines are called paragraphs in python-docx. But if I write "paragraphs" myself, for example the declared list, then i actualy mean the whole text-block under a heading. I hope this is not too confusing. Anyways, in each document the first line is the document title so thisone I save first. Second I check if a heading is detected. (The documents styles are manualy eddited so i can't use

if para.style.name == "Heading 1":

because there are none of these attributes in the documents). Headings are also stored but that's not the important part. After that I concatenate the text under each heading if the text has more than 6 characters (to filter out the "none") and the very last textblock gets concatenated in a different way (normaly that happens after a heading is detected) because after the last one, no heading will be detected anymore. Oh and yes, if the line is empty I concatenate a "\n" tot the paragraph text. The reason that I have in my mind for doing this is to be able to easier split this text-block later on. Because actualy the first text-block exists out of 2 smaller text-blocks which i'll need to save separately.
The problem is that this "\n"-concatenation also concatenates an "\n" in places where there is an empty line and where I actualy don't "need" an empty line, for example between a heading title and a first textblock, or after a textblock and before a new headingtitle....

From the example Word document layout that is given in the picture at the beginning of my post, the desired output would be a jsonlines textfile which looks like this:

{"text": "Text that belongs to his heading, Text that belongs to this heading. ...... of a newline character", "metadata": "doc_id"}
{"text": "Other text that belongs to thesame heading...... different paragraph.", "metadata": "doc_id"}
{"text": "More text...", "metadata": "doc_id"}

Probably the code that I've written to only extract the full textblock under a heading looks a bit messy but I've been trying many different approaches and thisone (finaly) was the one which works pretty good. Feel free to give some feedback on my code up there as well but the main issue is about how I need to 'print' separated paragraphs from a text-block and also get rid of extra \n-characters. Could anyone maybe give me a little help with that?

Thank you very much :)

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	Algorithm for extracting comments from Python source code	Pavel1982	7	3,266	Aug-28-2024, 02:50 AM Last Post: timothyferriss
	Python Code Help - pip install PyMuPDF python-docx pillow	Splishsplash92	3	2,380	Jun-05-2024, 06:49 AM Last Post: Pedroski55
	no module named 'docx' when importing docx	MaartenRo	1	6,292	Dec-31-2023, 11:21 AM Last Post: deanhystad
	Replace a text/word in docx file using Python	Devan	4	25,593	Oct-17-2023, 06:03 PM Last Post: Devan
	Word documents merging	crewdk	1	1,835	Apr-03-2023, 06:32 AM Last Post: buran
	python-docx: preserve formatting when printing lines	Tmagpy	4	5,683	Jul-09-2022, 01:15 AM Last Post: Tmagpy
	python-docx- change lowercase to bold, italic	Tmagpy	0	3,210	Jul-01-2022, 07:25 AM Last Post: Tmagpy
	python-docx regex : Browse the found words in turn from top to bottom	Tmagpy	0	2,439	Jun-27-2022, 08:45 AM Last Post: Tmagpy
	python-docx regex: replace any word in docx text	Tmagpy	4	3,990	Jun-18-2022, 09:12 AM Last Post: Tmagpy
	Problem: Check if a list contains a word and then continue with the next word	Mangono	2	3,794	Aug-12-2021, 04:25 PM Last Post: palladium

Extracting parts of paragraphs from word documents using python-docx library & lists

User Panel Messages

Announcements