Python Forum

Hello
I have xlsx file with 6 columns with 100 rows ~ .
the first thing I want to do is :
Read column A see the data , insert him to a list if it's not appear already
so I will have a list of all the names in this column

which package it will be best to use ?

Thanks,

openpyxl will do this
PyPi page: https://pypi.org/project/openpyxl/
Docs: https://openpyxl.readthedocs.io/en/stable/:
install with pip install openpyxl from command line.

OK
I have start to work with this
now I have anoter question

it's seem that it take too much time
if I understand correct it load the all file that start to run on it
so I took another file with 1000000 lines
and get Error:

you have  1048574  lines of data
MemoryError

During handling of the above exception, another exception occurred:

is there any wayto fix this ?

this is what I have done:

def ReadTxt():
    PID_Size = 0
    Error = False
    xlsx_file = Path('SimData', 'C:\\Users\\Computer\\Desktop\\Data.xlsx')
    wb_obj = openpyxl.load_workbook(xlsx_file)
    sheet = wb_obj.active
    print('you have ', str(sheet.max_row), ' lines of data')
    for row in sheet.iter_rows(min_row=None):
        PID_Name = row[0].value # this is where the PID located - first place
        try:
            PID_Size = len(PID_Name)
        except TypeError as e:
            print(PID_Name)
          #  print(e)
            PID_name = str(PID_Name)
            PID_Size = len(str(PID_Name))
            Error = True
        if not Error:
            if PID_Name not in str(PID_LIST) and PID_Size > 4:
                data = ""
                for cell in row:
                    data = (data + ":" + str(cell.value))
                data = data[1:]
                index = data.find(':')
                # print (index)
                data = data[0:index+1] + "    " + data[index+5:]
                PID_LIST.append(data)

    for pid in PID_LIST:
        print(pid)

and this is the resualt I'm getting (which is what I wanted)

18FC08F4:    D0:2:3D:28:61:26:18:A9
18F3C999:    0:0:0:0:0:FF:5A:0
18FC07F4:    B0:13:B0:13:77:1:48:36
18FC1621:    3:C7:0:0:D1:2:FF:E0
0CFE5A2F:    11:0:0:0:D3:F5:FF:F0
18FC1721:    F8:2A:20:4E:FF:FF:FE:EE
18FE592F:    D2:7C:83:7C:C9:7C:EA:7C
0C000027:    23:FF:FF:F3:FF:FF:FF:F3
.
.
.
.

I have used this successfully for quite a while, but never with so much data.
I can only suggest that you try some of the other packages available.
Here's what available: https://pypi.org/search/?q=xslx&o=-zscore
This shows all packages having anything to do with xslx, so a lot of it won't be appropriate for your application.
I wish I could help more, but haven't tried any of these.

Perhaps others will have a suggestion for you.

You could try to convert the xlsx file to csv through external command. Example with libreoffice:

Output:
soffice -env:UserInstallation=file:///tmp/test --headless --convert-to csv --outdir ~/tmp foo.xlsx

This creates file ~/tmp/foo.csv

ok
and CVS is much easier to work with ? (I'm not familiar with all excel option at all)
it will take less memory ?

Thanks ,

Well, the csv reader in the standard csv module processes the file line by line for example. You don't need to keep more than one line at a time in memory.

Ok
What I will need to change in the code ?

Thanks,

korenron Wrote:What I will need to change in the code ?

You wouldn't use openpyxl any more. Instead you would iterate on the rows of an object returned by a call to csv.reader() as shown in the documenrtation of the csv module.

I will try it

Thank you for the help!

korenron

Larz60+

korenron

Larz60+

Gribouillis

korenron

Gribouillis

korenron

Gribouillis

korenron