docx file to pandas dataframe/excel

iitip92 · Jun-26-2024, 07:40 PM

Hi everyone,
following the docx documentation and dropbox API documentation I wanted to download doc and docx files from a dropbox directory. I used this function which I adapted to return a pandas dataframe:

def download_file(dbx, path):
##    path = '/%s/%s/%s' % (folder, subfolder.replace(os.path.sep, '/'), name)
##    dest = '/%s/%s' % (folder, subfolder.replace(os.path.sep, '/'))
    while '//' in path:
        path = path.replace('//', '/')
##    while '//' in dest:
##        dest = dest.replace('//', '/')
    with stopwatch('download'):
        try:
            md,res = dbx.files_download(path)
        except dropbox.exceptions.HttpError as err:
            #print('*** HTTP error', err)
            return None
        except dropbox.exceptions.ApiError:
            #print('** API error', dropbox.exceptions.ApiError)
            return None
##    with open(res.content, encoding='CP1252') as data_handle:
##    data = res.content
##        s = str(data_handle)
##        s_data = StringIO(s)
    #decode unreadable 0x90 byte
    #with codecs.open(s_data, encoding='iso-8859-1') as s_data_handle:
    try:
        s_data = base64.b64decode(res.content)
    except:
        s_data = ""
    try:
        json_data = json.loads(s_data)
        json_str = json_data.decode('CP1252')
        dataframe = pd.DataFrame(json_str)
        print(dataframe)
    except:
        dataframe = pd.DataFrame()
    #print(len(data))
    return dataframe

You can see that I tried several decoding options because the files have a table and I am having difficulty parsing them and saving the doc files locally and passing them to a dataframe.
In my main() function I call the download_file as such, from a for loop reading directories:

                    res = download_file(dbx, fname)
                    doc = Document()
                    try:
                        t = doc.add_table(res.shape[0] + 1, res.shape[1])
                        #add header row
                        for j in range(res.shape[-1]):
                            t.cell(0, j).text = res.columns[j]
                        #populate rest of table
                        for i in range(res.shape[0]):
                            for j in range(res.shape[-1]):
                                t.cell(i + 1, j).text = str(res.values[i, j])
                                
                    except ValueError:
                        cleaned_str = ''.join(c for c in res.decode('CP1252', errors='ignore') if valid_xml_char_ordinal(c))
                        doc.add_paragraph(cleaned_str)
                        break

At this point I have downloaded the docs files, however they are unreadable by ms word. Is there a need to download the files locally and then reading using again docx tables method:

        doc1 = Document('./' + str(fname).replace('\\','/'))
        table = doc1.tables[0]
        data1 = []
        for i, row in enumerate(table.rows):
            text = (cell.text for cell in row.cells)
            #...

Is there a simpler solution? At this point my objective is to read doc/docx files from dropbox and saving the content of their tables in excel locally.
Thanks for your time

Pedroski55 · (This post was last modified: Jun-27-2024, 05:43 AM by Pedroski55.)

Got an example docx file to work on?

If the files are unreadable, nothing will help!

This gets the tables to pandas:

from docx import Document
import pandas as pd

mydocxfile = '/home/pedro/myPython/docxFiles/example_table2.docx'
for table in Document(mydocxfile).tables:
    data = [[cell.text for cell in row.cells] for row in table.rows]
    print(pd.DataFrame(data[1:], columns=data[0]), "\n")

Gives:

Output:           Name Age     Occupation
0  King Charles  75           King
1         Pedro  55       Layabout
2          Baby  32  Import-Export

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	Converting Pandas DataFrame to a table of hourly blocks	Abedin	1	755	Apr-24-2025, 01:05 PM Last Post: snippsat
	Most efficient way to roll through a pandas dataframe?	sawtooth500	2	1,289	Aug-28-2024, 10:08 AM Last Post: Alice12
	Python openyxl not updating Excel file	MrBean12	1	2,321	Mar-03-2024, 12:16 AM Last Post: MrBean12
	Copy Paste excel files based on the first letters of the file name	Viento	2	1,708	Feb-07-2024, 12:24 PM Last Post: Viento
	no module named 'docx' when importing docx	MaartenRo	1	6,292	Dec-31-2023, 11:21 AM Last Post: deanhystad
	Python Alteryx QS-Passing pandas dataframe column inside SQL query where condition	sanky1990	0	1,490	Dec-04-2023, 09:48 PM Last Post: sanky1990
	Search Excel File with a list of values	huzzug	4	3,060	Nov-03-2023, 05:35 PM Last Post: huzzug
	Updating sharepoint excel file odd results	cubangt	1	2,213	Nov-03-2023, 05:13 PM Last Post: noisefloor
	Python and pandas: Aggregate lines form Excel sheet	Glyxbringer	12	5,688	Oct-31-2023, 10:21 AM Last Post: Pedroski55
	Replace a text/word in docx file using Python	Devan	4	25,594	Oct-17-2023, 06:03 PM Last Post: Devan

docx file to pandas dataframe/excel

User Panel Messages

Announcements