Python Forum
Sorting data with pandas
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Sorting data with pandas
#1
Question 
I am working on an inventory software which use Optical Character Recognition (EasyOCR) to can a document. Now I got the problem of cleaning/sorting the data (using pandas). I want a csv file like this:

Quote:Article Number: Name:
56451748468434 Shoe
24564165165145 Boat
...


The data I got looks like this:
Quote:00605555
Retail Apple I2W USB Power Adapler
00605558
Lightning t0 USB Cable (2 m)
00605613
Apple Lightn t0
Smm Headphone
00805614
Apple iPhone Lightning Dock Black
0C605615
Apple EarPods wilh Lighining Con;
00605774
Huawei GigaCube LTE CPE Fenstera
00605806
Google Home
00605834
Belkin BOOSTUP Wireless Charger
00605872
Google Hame Mini

I tried:
data = pd.DataFrame(data['1'].values.reshape(-1, 2), columns=['Artikel', 'AN'])
But then my output looks like this:
Output:
Artikel AN 0 Retail Apple I2W USB Power Adapler 00605558 1 Lightning t0 USB Cable (2 m) 00605613 2 Apple Lightn t0 Smm Headphone 3 00805614 Apple iPhone Lightning Dock Black 4 0C605615 Apple EarPods wilh Lighining Con; 5


Any idea how I can clean/sort the data properly (only the numbers on one side and the name on the other side)?

Thank you! :)
Reply
#2
Looking at the data, first 6 lines alternate between number and description. Then you get 2 lines of description which is going to be problematic. If you try to filter by if numbers only it is the code, then one of the codes has a "C" in it which again is a problem.
You need to be able to describe how the program is to distinguish the columns, then you can try to program it.
Reply
#3
(Nov-15-2021, 07:25 PM)jefsummers Wrote: Looking at the data, first 6 lines alternate between number and description. Then you get 2 lines of description which is going to be problematic. If you try to filter by if numbers only it is the code, then one of the codes has a "C" in it which again is a problem.
You need to be able to describe how the program is to distinguish the columns, then you can try to program it.
Yeah, thank you, I fixed the problem with the 'C'. But how can I program it to sort it by numbers and on the other side all the text under it till a new number comes
Reply
#4
Not sure what the data looks like exactly when you import, but I suggest using a try...except block, where you use int() on the string. If it passes, it's a number, if it throws the exception it is not.
Reply
#5
You should show how you are importing the data to help answer the question.

I would recommend parsing this data into an actual csv format before reading into pandas:

import csv


headers = ['AN', 'Artikel']
with open('data.txt', 'r') as in_data, open('out.csv', 'w', newline='') as out_csv:
    csv_writer = csv.writer(out_csv)

    i = 0
    row = []

    csv_writer.writerow(headers)

    # make sure your data ends in a new line or you won't import the last line
    for line in in_data.readlines():
        if i % 2 == 0 and i != 0:
            csv_writer.writerow(row)
            row.clear()
        
        i += 1
        row.append(line.strip())

    csv_writer.writerow(row)
    row.clear()
This converts this:
12345S
Item 1
4152DS
Item 2
15190A
Item 3


To this:

AN,Artikel
12345S,Item 1
4152DS,Item 2
15190A,Item 3


Which will be read by pandas cleaner.You should show how you are importing the data to help answer the question.

Which will be read by pandas cleaner.

Edit: To import and sort:

import pandas as pd
df = pd.read_csv('out.csv')
df.sort_values(["AN"], inplace=True)
print(df.head())
Output:
       AN Artikel
0  12345S  Item 1
2  15190A  Item 3
1  4152DS  Item 2
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  Pandas Data frame column condition check based on length of the value aditi06 1 641 Jul-28-2021, 11:08 AM
Last Post: jefsummers
  [Pandas] Write data to Excel with dot decimals manonB 1 835 May-05-2021, 05:28 PM
Last Post: ibreeden
  pandas.to_datetime: Combine data from 2 columns ju21878436312 1 947 Feb-20-2021, 08:25 PM
Last Post: perfringo
  pandas read_csv can't handle missing data mrdominikku 0 1,009 Jul-09-2020, 12:26 PM
Last Post: mrdominikku
  Pandas data frame creation from Kafka Topic vboppa 0 768 Jul-01-2020, 04:23 PM
Last Post: vboppa
  Generate Test data (.csv) using Pandas Ashley 5 1,318 Jun-15-2020, 02:51 PM
Last Post: jefsummers
  Read json array data by pandas vipinct 0 826 Apr-13-2020, 02:24 PM
Last Post: vipinct
  add formatted column to pandas data frame alkaline3 0 758 Mar-22-2020, 06:44 PM
Last Post: alkaline3
  pandas DataReader error on all data sources glidecode 5 13,926 Sep-25-2019, 02:10 PM
Last Post: perfringo
  Loop pandas data frame by position ? Johnse 1 1,173 Sep-06-2019, 12:26 AM
Last Post: scidam

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020