Python Forum
Extracting parts of paragraphs from word documents using python-docx library & lists
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Extracting parts of paragraphs from word documents using python-docx library & lists
#1
About the type of documents i'm working with, the general layout looks like this: [Image: WWk87zV]

Background of my problem?
As a part of a project in text-classification I need to extract text-data from Word documents because I'll need to annotate these documents on paragraph level which means my end result of the part of code that I'm working on now will be a jsonlines-file filled with json-objects for each document.

What I already did.
In a part of my code I extract all the text which is located under every heading separately and saved this text in a list. That works fine but the problem is that the text under heading 1 (see picture) actualy exists out of 2 paragraphs and I need to split these paragraphs as well so they can be written in a separated json-string & later on be separated annotated as well. I have the feeling that I need to work with "list of lists" in there as well as with the .splitlines()-method but I'm making a mess there because i'm inexperienced with Python and working with list from list is a bit confusing for me at this point.

I'll start to show the function that I wrote to extract just the "big piece" of text under the headings from a document. I didn't write to a jsonlines-file in my code yet but with doing that I don't have issues. I'm using print() methods to test my output until I get the right output...

def writeParas(filename):
     heading = ""                    #Here the headings will be saved while iterating over the lines from the document
     paragraphs = []                 #List to stack the content from the headings
     part_Paragraphs = []            #Not used in this part of code, this is the part where my problems are...
     paraNr = 0                      #To count which line in the file has been reached
     p_text = ""                     
     startNum = re.compile(r'\A\d+') #To detect a heading by it's start with a number
     d = Document(path + filename)
     amount_of_para = len(d.paragraphs)

     for para in d.paragraphs:
          paraNr += 1

          if(paraNr == 1): 
               title = para.text
          
          elif(paraNr > 1 and startNum.match(para.text)):
               heading = ""
               if p_text != "":
                    headings.append(patext)
               p_text = ""

               for run in para.runs: 
                    if run.bold and run.underline:
                         headings += " "+run.text
                    else:
                         headings = "This is not a heading"

          elif (paraNr > 1):
               if para.text == ""
                    p_text += "\n" 
               elif len(para.text) > 6:   
                    p_text += " "+para.text
                    if paraNr == aantalPara-1:
                         headings.append(patext)
Description Python-code:
With python-docx you can read each line in a document separately. These lines are called paragraphs in python-docx. But if I write "paragraphs" myself, for example the declared list, then i actualy mean the whole text-block under a heading. I hope this is not too confusing. Anyways, in each document the first line is the document title so thisone I save first. Second I check if a heading is detected. (The documents styles are manualy eddited so i can't use
if para.style.name == "Heading 1":
because there are none of these attributes in the documents). Headings are also stored but that's not the important part. After that I concatenate the text under each heading if the text has more than 6 characters (to filter out the "none") and the very last textblock gets concatenated in a different way (normaly that happens after a heading is detected) because after the last one, no heading will be detected anymore. Oh and yes, if the line is empty I concatenate a "\n" tot the paragraph text. The reason that I have in my mind for doing this is to be able to easier split this text-block later on. Because actualy the first text-block exists out of 2 smaller text-blocks which i'll need to save separately.
The problem is that this "\n"-concatenation also concatenates an "\n" in places where there is an empty line and where I actualy don't "need" an empty line, for example between a heading title and a first textblock, or after a textblock and before a new headingtitle....

From the example Word document layout that is given in the picture at the beginning of my post, the desired output would be a jsonlines textfile which looks like this:

{"text": "Text that belongs to his heading, Text that belongs to this heading. ...... of a newline character", "metadata": "doc_id"}
{"text": "Other text that belongs to thesame heading...... different paragraph.", "metadata": "doc_id"}
{"text": "More text...", "metadata": "doc_id"}

Probably the code that I've written to only extract the full textblock under a heading looks a bit messy but I've been trying many different approaches and thisone (finaly) was the one which works pretty good. Feel free to give some feedback on my code up there as well but the main issue is about how I need to 'print' separated paragraphs from a text-block and also get rid of extra \n-characters. Could anyone maybe give me a little help with that?

Thank you very much :)
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  Algorithm for extracting comments from Python source code Pavel1982 6 506 Feb-28-2024, 09:52 PM
Last Post: Pavel1982
  no module named 'docx' when importing docx MaartenRo 1 821 Dec-31-2023, 11:21 AM
Last Post: deanhystad
  Replace a text/word in docx file using Python Devan 4 3,274 Oct-17-2023, 06:03 PM
Last Post: Devan
  Word documents merging crewdk 1 849 Apr-03-2023, 06:32 AM
Last Post: buran
  python-docx: preserve formatting when printing lines Tmagpy 4 2,082 Jul-09-2022, 01:15 AM
Last Post: Tmagpy
  python-docx- change lowercase to bold, italic Tmagpy 0 1,397 Jul-01-2022, 07:25 AM
Last Post: Tmagpy
  python-docx regex : Browse the found words in turn from top to bottom Tmagpy 0 1,509 Jun-27-2022, 08:45 AM
Last Post: Tmagpy
  python-docx regex: replace any word in docx text Tmagpy 4 2,212 Jun-18-2022, 09:12 AM
Last Post: Tmagpy
Question Problem: Check if a list contains a word and then continue with the next word Mangono 2 2,486 Aug-12-2021, 04:25 PM
Last Post: palladium
  Docx Convert Word Header to Body CaptainCsaba 3 2,746 Jun-02-2021, 01:25 PM
Last Post: Larz60+

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020