Error while parsing tables from docx file

aditi · (This post was last modified: Jul-11-2020, 02:30 AM by Larz60+.)

Hi all,
I am currently using Python3.6 and I am using the python-docx package to parse document files:
The code:

from docx import Document
my_doc = Document(doc)

def extract(my_doc,w1,w2):
tabdata = []
for table in my_doc.tables: #looping through all tables in the .docx file
if re.search("My String", table.cell(0,1).text, re.IGNORECASE): # table
for row in table.rows: #looping through all rows in the table under consideration
for cell in row.cells:
tabdata = cell.text

For multiple document files, I am facing different errors for the same code. This document contains a combination of text and tables, and I am trying to parse just the tables.
For certain document files, I am able to parse the file when it contains both text and tables.
But for certain other files this error shows up.
All the files are similar and contain the keywords and tables I am searching for using the re.search() function. All the tables in the different files have equal number of rows and columns.
The error doesn’t show up if the document file contains only the tables and no other text/paragraphs.
I am unsure if this issue lies with a corrupted docx file, the docx file contains characters not parsed by my script or if I am missing some part in the script.

The error I am facing:

Traceback (most recent call last):
File "my_python_script.py", line 579, in main_1
extract(mld,w1,w2)
File "my_python_script.py", line 128, in extract
if re.search("My String", table.cell(0,1).text, re.IGNORECASE):
File $PYTHONPATH/python3.6/site-packages/docx/table.py", line 81, in cell
return self._cells[cell_idx]
IndexError: list index out of range

Any help on this issue would be much appreciated!

aditi · (This post was last modified: Jul-14-2020, 09:26 PM by aditi.)

I see, does this make it better?
The input file:

My Document = some_doc.docx

import os
import re
from docx import Document

f = open("sometext.txt", "w")
def get_inputs():
    global my_doc, my_doc_file

    input_file = open('input.txt','r')                     #reading inputs text file
    names = input_file.readlines()
    input_file.close()
    for i in range(0,len(names)):
        if re.search(r'My document',names[i],re.IGNORECASE):
            pos = str(names[i]).rfind("=")
            my_doc = str(names[i][pos+1:]).strip()
    if os.stat(my_doc).st_size == 0:                       #checking the size of my_doc, if zero then message displays
        print("Empty Document! Please check and retry!")    
    my_doc_file = Document(my_doc)                         #reading the .docx file, throws error if it does not exist
    print(my_doc, my_doc_file)


def extract(my_doc):
    tlist = []
    tab_list = []
    #global my_doc, my_doc_file
    for table in my_doc.tables:                                        #looping through all tables in the .docx file
        if re.search("mystring", table.cell(0,1).text, re.IGNORECASE):            
            for row in table.rows:                                  #looping through all rows in the table under consideration
                for cell in row.cells:                                  #looping through all cells(grid cells are considered, not actual) in the row under consideration
                    tabdata = cell.text
                    tabdata = re.sub(r'\s+',"",tabdata)                     #cell.text is the text in each cell
                    tlist.append(tabdata)                               #appending to tlist(tlist is a list/array) a list of all the text in a row
                tab_list.append(tlist)                                  #list of all rows
                tlist = []
            f.write(tab_list)                                  
            tab_list=[]

def main():
    global my_doc, my_doc_file
    get_inputs()
    extract(my_doc_file)
    print("DONE!")

main()

and the error is:

Error:Traceback (most recent call last):
  File "gen_scr.py", line 44, in <module>
    main()
  File "gen_scr.py", line 41, in main
    extract(my_doc_file)
  File "gen_scr.py", line 27, in extract
    if re.search("mystring", table.cell(0,1).text, re.IGNORECASE):            
  File "$PYTHONPATH/python3.6/site-packages/docx/table.py", line 81, in cell
    return self._cells[cell_idx]
IndexError: list index out of range

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	How to remove unwanted images and tables from a Word file using Python?	rownong	2	928	Feb-04-2025, 08:30 AM Last Post: Pedroski55
	docx file to pandas dataframe/excel	iitip92	1	3,066	Jun-27-2024, 05:28 AM Last Post: Pedroski55
	Reading an ASCII text file and parsing data...	oradba4u	2	1,573	Jun-08-2024, 12:41 AM Last Post: oradba4u
	no module named 'docx' when importing docx	MaartenRo	1	6,197	Dec-31-2023, 11:21 AM Last Post: deanhystad
	Replace a text/word in docx file using Python	Devan	4	25,107	Oct-17-2023, 06:03 PM Last Post: Devan
	doing data treatment on a file import-parsing a variable	EmBeck87	15	6,042	Apr-17-2023, 06:54 PM Last Post: EmBeck87
	Use module docx to get text from a file with a table	Pedroski55	8	20,469	Aug-30-2022, 10:52 PM Last Post: Pedroski55
	python-docx regex: replace any word in docx text	Tmagpy	4	3,952	Jun-18-2022, 09:12 AM Last Post: Tmagpy
	Modify values in XML file by data from text file (without parsing)	Paqqno	2	3,305	Apr-13-2022, 06:02 AM Last Post: Paqqno
	Parsing xml file deletes whitespaces. How to avoid it?	Paqqno	0	1,764	Apr-01-2022, 10:20 PM Last Post: Paqqno

Error while parsing tables from docx file

User Panel Messages

Announcements