Python Forum
Sorting data with pandas - Printable Version

+- Python Forum (https://python-forum.io)
+-- Forum: Python Coding (https://python-forum.io/forum-7.html)
+--- Forum: Data Science (https://python-forum.io/forum-44.html)
+--- Thread: Sorting data with pandas (/thread-35545.html)



Sorting data with pandas - TheZaind - Nov-15-2021

I am working on an inventory software which use Optical Character Recognition (EasyOCR) to can a document. Now I got the problem of cleaning/sorting the data (using pandas). I want a csv file like this:

Quote:Article Number: Name:
56451748468434 Shoe
24564165165145 Boat
...


The data I got looks like this:
Quote:00605555
Retail Apple I2W USB Power Adapler
00605558
Lightning t0 USB Cable (2 m)
00605613
Apple Lightn t0
Smm Headphone
00805614
Apple iPhone Lightning Dock Black
0C605615
Apple EarPods wilh Lighining Con;
00605774
Huawei GigaCube LTE CPE Fenstera
00605806
Google Home
00605834
Belkin BOOSTUP Wireless Charger
00605872
Google Hame Mini

I tried:
data = pd.DataFrame(data['1'].values.reshape(-1, 2), columns=['Artikel', 'AN'])
But then my output looks like this:
Output:
Artikel AN 0 Retail Apple I2W USB Power Adapler 00605558 1 Lightning t0 USB Cable (2 m) 00605613 2 Apple Lightn t0 Smm Headphone 3 00805614 Apple iPhone Lightning Dock Black 4 0C605615 Apple EarPods wilh Lighining Con; 5


Any idea how I can clean/sort the data properly (only the numbers on one side and the name on the other side)?

Thank you! :)


RE: Sorting data with pandas - jefsummers - Nov-15-2021

Looking at the data, first 6 lines alternate between number and description. Then you get 2 lines of description which is going to be problematic. If you try to filter by if numbers only it is the code, then one of the codes has a "C" in it which again is a problem.
You need to be able to describe how the program is to distinguish the columns, then you can try to program it.


RE: Sorting data with pandas - TheZaind - Nov-15-2021

(Nov-15-2021, 07:25 PM)jefsummers Wrote: Looking at the data, first 6 lines alternate between number and description. Then you get 2 lines of description which is going to be problematic. If you try to filter by if numbers only it is the code, then one of the codes has a "C" in it which again is a problem.
You need to be able to describe how the program is to distinguish the columns, then you can try to program it.
Yeah, thank you, I fixed the problem with the 'C'. But how can I program it to sort it by numbers and on the other side all the text under it till a new number comes


RE: Sorting data with pandas - jefsummers - Nov-17-2021

Not sure what the data looks like exactly when you import, but I suggest using a try...except block, where you use int() on the string. If it passes, it's a number, if it throws the exception it is not.


RE: Sorting data with pandas - aserian - Nov-22-2021

You should show how you are importing the data to help answer the question.

I would recommend parsing this data into an actual csv format before reading into pandas:

import csv


headers = ['AN', 'Artikel']
with open('data.txt', 'r') as in_data, open('out.csv', 'w', newline='') as out_csv:
    csv_writer = csv.writer(out_csv)

    i = 0
    row = []

    csv_writer.writerow(headers)

    # make sure your data ends in a new line or you won't import the last line
    for line in in_data.readlines():
        if i % 2 == 0 and i != 0:
            csv_writer.writerow(row)
            row.clear()
        
        i += 1
        row.append(line.strip())

    csv_writer.writerow(row)
    row.clear()
This converts this:
12345S
Item 1
4152DS
Item 2
15190A
Item 3


To this:

AN,Artikel
12345S,Item 1
4152DS,Item 2
15190A,Item 3


Which will be read by pandas cleaner.You should show how you are importing the data to help answer the question.

Which will be read by pandas cleaner.

Edit: To import and sort:

import pandas as pd
df = pd.read_csv('out.csv')
df.sort_values(["AN"], inplace=True)
print(df.head())
Output:
       AN Artikel
0  12345S  Item 1
2  15190A  Item 3
1  4152DS  Item 2