Python Forum
docx file to pandas dataframe/excel
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
docx file to pandas dataframe/excel
#1
Hi everyone,
following the docx documentation and dropbox API documentation I wanted to download doc and docx files from a dropbox directory. I used this function which I adapted to return a pandas dataframe:
def download_file(dbx, path):
##    path = '/%s/%s/%s' % (folder, subfolder.replace(os.path.sep, '/'), name)
##    dest = '/%s/%s' % (folder, subfolder.replace(os.path.sep, '/'))
    while '//' in path:
        path = path.replace('//', '/')
##    while '//' in dest:
##        dest = dest.replace('//', '/')
    with stopwatch('download'):
        try:
            md,res = dbx.files_download(path)
        except dropbox.exceptions.HttpError as err:
            #print('*** HTTP error', err)
            return None
        except dropbox.exceptions.ApiError:
            #print('** API error', dropbox.exceptions.ApiError)
            return None
##    with open(res.content, encoding='CP1252') as data_handle:
##    data = res.content
##        s = str(data_handle)
##        s_data = StringIO(s)
    #decode unreadable 0x90 byte
    #with codecs.open(s_data, encoding='iso-8859-1') as s_data_handle:
    try:
        s_data = base64.b64decode(res.content)
    except:
        s_data = ""
    try:
        json_data = json.loads(s_data)
        json_str = json_data.decode('CP1252')
        dataframe = pd.DataFrame(json_str)
        print(dataframe)
    except:
        dataframe = pd.DataFrame()
    #print(len(data))
    return dataframe
You can see that I tried several decoding options because the files have a table and I am having difficulty parsing them and saving the doc files locally and passing them to a dataframe.
In my main() function I call the download_file as such, from a for loop reading directories:

                    res = download_file(dbx, fname)
                    doc = Document()
                    try:
                        t = doc.add_table(res.shape[0] + 1, res.shape[1])
                        #add header row
                        for j in range(res.shape[-1]):
                            t.cell(0, j).text = res.columns[j]
                        #populate rest of table
                        for i in range(res.shape[0]):
                            for j in range(res.shape[-1]):
                                t.cell(i + 1, j).text = str(res.values[i, j])
                                
                    except ValueError:
                        cleaned_str = ''.join(c for c in res.decode('CP1252', errors='ignore') if valid_xml_char_ordinal(c))
                        doc.add_paragraph(cleaned_str)
                        break
At this point I have downloaded the docs files, however they are unreadable by ms word. Is there a need to download the files locally and then reading using again docx tables method:

        doc1 = Document('./' + str(fname).replace('\\','/'))
        table = doc1.tables[0]
        data1 = []
        for i, row in enumerate(table.rows):
            text = (cell.text for cell in row.cells)
            #...
Is there a simpler solution? At this point my objective is to read doc/docx files from dropbox and saving the content of their tables in excel locally.
Thanks for your time
Reply
#2
Got an example docx file to work on?

If the files are unreadable, nothing will help!

This gets the tables to pandas:

from docx import Document
import pandas as pd

mydocxfile = '/home/pedro/myPython/docxFiles/example_table2.docx'
for table in Document(mydocxfile).tables:
    data = [[cell.text for cell in row.cells] for row in table.rows]
    print(pd.DataFrame(data[1:], columns=data[0]), "\n")
Gives:

Output:
Name Age Occupation 0 King Charles 75 King 1 Pedro 55 Layabout 2 Baby 32 Import-Export
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  Most efficient way to roll through a pandas dataframe? sawtooth500 1 172 Jun-07-2024, 03:37 AM
Last Post: sawtooth500
  Python openyxl not updating Excel file MrBean12 1 541 Mar-03-2024, 12:16 AM
Last Post: MrBean12
  Copy Paste excel files based on the first letters of the file name Viento 2 627 Feb-07-2024, 12:24 PM
Last Post: Viento
  no module named 'docx' when importing docx MaartenRo 1 1,354 Dec-31-2023, 11:21 AM
Last Post: deanhystad
  Python Alteryx QS-Passing pandas dataframe column inside SQL query where condition sanky1990 0 865 Dec-04-2023, 09:48 PM
Last Post: sanky1990
  Search Excel File with a list of values huzzug 4 1,456 Nov-03-2023, 05:35 PM
Last Post: huzzug
  Updating sharepoint excel file odd results cubangt 1 1,056 Nov-03-2023, 05:13 PM
Last Post: noisefloor
  Python and pandas: Aggregate lines form Excel sheet Glyxbringer 12 2,305 Oct-31-2023, 10:21 AM
Last Post: Pedroski55
  Replace a text/word in docx file using Python Devan 4 4,585 Oct-17-2023, 06:03 PM
Last Post: Devan
  Question on pandas.dataframe merging two colums shomikc 4 993 Jun-29-2023, 11:30 AM
Last Post: snippsat

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020