Python Forum
Error while parsing tables from docx file
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Error while parsing tables from docx file
#1
Hi all,
I am currently using Python3.6 and I am using the python-docx package to parse document files:
The code:

from docx import Document
my_doc = Document(doc)

def extract(my_doc,w1,w2):
tabdata = []
for table in my_doc.tables: #looping through all tables in the .docx file
if re.search("My String", table.cell(0,1).text, re.IGNORECASE): # table
for row in table.rows: #looping through all rows in the table under consideration
for cell in row.cells:
tabdata = cell.text

For multiple document files, I am facing different errors for the same code. This document contains a combination of text and tables, and I am trying to parse just the tables.
For certain document files, I am able to parse the file when it contains both text and tables.
But for certain other files this error shows up.
All the files are similar and contain the keywords and tables I am searching for using the re.search() function. All the tables in the different files have equal number of rows and columns.
The error doesn’t show up if the document file contains only the tables and no other text/paragraphs.
I am unsure if this issue lies with a corrupted docx file, the docx file contains characters not parsed by my script or if I am missing some part in the script.

The error I am facing:

Traceback (most recent call last):
File "my_python_script.py", line 579, in main_1
extract(mld,w1,w2)
File "my_python_script.py", line 128, in extract
if re.search("My String", table.cell(0,1).text, re.IGNORECASE):
File $PYTHONPATH/python3.6/site-packages/docx/table.py", line 81, in cell
return self._cells[cell_idx]
IndexError: list index out of range

Any help on this issue would be much appreciated!
Reply
#2
I see, does this make it better?
The input file:

My Document = some_doc.docx

import os
import re
from docx import Document

f = open("sometext.txt", "w")
def get_inputs():
    global my_doc, my_doc_file

    input_file = open('input.txt','r')                     #reading inputs text file
    names = input_file.readlines()
    input_file.close()
    for i in range(0,len(names)):
        if re.search(r'My document',names[i],re.IGNORECASE):
            pos = str(names[i]).rfind("=")
            my_doc = str(names[i][pos+1:]).strip()
    if os.stat(my_doc).st_size == 0:                       #checking the size of my_doc, if zero then message displays
        print("Empty Document! Please check and retry!")    
    my_doc_file = Document(my_doc)                         #reading the .docx file, throws error if it does not exist
    print(my_doc, my_doc_file)


def extract(my_doc):
    tlist = []
    tab_list = []
    #global my_doc, my_doc_file
    for table in my_doc.tables:                                        #looping through all tables in the .docx file
        if re.search("mystring", table.cell(0,1).text, re.IGNORECASE):            
            for row in table.rows:                                  #looping through all rows in the table under consideration
                for cell in row.cells:                                  #looping through all cells(grid cells are considered, not actual) in the row under consideration
                    tabdata = cell.text
                    tabdata = re.sub(r'\s+',"",tabdata)                     #cell.text is the text in each cell
                    tlist.append(tabdata)                               #appending to tlist(tlist is a list/array) a list of all the text in a row
                tab_list.append(tlist)                                  #list of all rows
                tlist = []
            f.write(tab_list)                                  
            tab_list=[]

def main():
    global my_doc, my_doc_file
    get_inputs()
    extract(my_doc_file)
    print("DONE!")

main()
and the error is:
Error:
Traceback (most recent call last): File "gen_scr.py", line 44, in <module> main() File "gen_scr.py", line 41, in main extract(my_doc_file) File "gen_scr.py", line 27, in extract if re.search("mystring", table.cell(0,1).text, re.IGNORECASE): File "$PYTHONPATH/python3.6/site-packages/docx/table.py", line 81, in cell return self._cells[cell_idx] IndexError: list index out of range
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  no module named 'docx' when importing docx MaartenRo 1 889 Dec-31-2023, 11:21 AM
Last Post: deanhystad
  Replace a text/word in docx file using Python Devan 4 3,459 Oct-17-2023, 06:03 PM
Last Post: Devan
Video doing data treatment on a file import-parsing a variable EmBeck87 15 2,911 Apr-17-2023, 06:54 PM
Last Post: EmBeck87
  Use module docx to get text from a file with a table Pedroski55 8 6,196 Aug-30-2022, 10:52 PM
Last Post: Pedroski55
  python-docx regex: replace any word in docx text Tmagpy 4 2,247 Jun-18-2022, 09:12 AM
Last Post: Tmagpy
  Modify values in XML file by data from text file (without parsing) Paqqno 2 1,690 Apr-13-2022, 06:02 AM
Last Post: Paqqno
  Parsing xml file deletes whitespaces. How to avoid it? Paqqno 0 1,043 Apr-01-2022, 10:20 PM
Last Post: Paqqno
  Parsing a syslog file ebolisa 11 4,154 Oct-10-2021, 05:15 PM
Last Post: snippsat
Thumbs Up Parsing a YAML file without changing the string content..?, Flask - solved. SpongeB0B 2 2,285 Aug-05-2021, 08:02 AM
Last Post: SpongeB0B
  Rename docx file from tuple gjack 2 2,202 Oct-20-2020, 05:33 PM
Last Post: gjack

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020