Dealing with duplicated data in a CSV file

Dealing with duplicated data in a CSV file - Printable Version

+- Python Forum (https://python-forum.io)
+-- Forum: Python Coding (https://python-forum.io/forum-7.html)
+--- Forum: General Coding Help (https://python-forum.io/forum-8.html)
+--- Thread: Dealing with duplicated data in a CSV file (/thread-34829.html)

Pages: 1 2

Dealing with duplicated data in a CSV file - bts001 - Sep-04-2021

I am new to the programming world and I am looking for advice on the scenario below:

I have a csv file with four columns of data and I want to extract specific cells within the file. I want to do this through these steps:
1) Analyze the first column for duplicates
2) Using the first duplicate row, extract the value in the second and third column.
3) Store the extracted data in a new column or seperate csv file
4) Repeat for all duplicates

Note: I am not trying to remove duplicates, in fact I am trying to target them and keep only the first duplicate row of each.

What would be the easiest way to approach this?

Thanks!

RE: Dealing with duplicated data in a CSV file - SamHobbs - Sep-04-2021

There are a couple of details that should be clarified.

First, can you assume the data is sorted by the first column? Can you rely on all duplicates to be together?

Second, is it possible for there to be more than one duplicate? If so then what do you need to do for more than one duplicate?

RE: Dealing with duplicated data in a CSV file - bts001 - Sep-04-2021

(Sep-04-2021, 10:37 PM)SamHobbs Wrote: There are a couple of details that should be clarified.

First, can you assume the data is sorted by the first column? Can you rely on all duplicates to be together?

Second, is it possible for there to be more than one duplicate? If so then what do you need to do for more than one duplicate?

Yes the data is sorted by the first column, all duplicates will be together and at max there will only be one duplicate. For clarification, the first column will look something like 1,2,3,4,5,5,6,7,8,8,9,9,10, etc. What I need is the data in the corresponding row of the first time one of the duplicates is mentioned, so 1,2,3,4,5,5,6,7,8,8,9,9,10 (the bolded rows are all I need).

RE: Dealing with duplicated data in a CSV file - SamHobbs - Sep-05-2021

I am new to Python but I have been programming for half a century. The following uses techniques I might have used in COBOL but using Python of course. My data is a bit different from what you specify but I hope it is close enough. When you ask questions if you can provide a small amount of sample data as in the following then that helps those that want to help you. I am using:

key,f1,f2
1,data1,data2
2,data3,data4
3,data5,data6
4,data7,data8
5,data9,data10
5,data11,data12
6,data13,data14
7,data15,data16
8,data17,data18
8,data19,data20
9,data21,data22
9,data23,data24
10,data25,data26

The following:

import sys
import csv
with open('WithDuplicates.csv', newline='') as csvfile:
    dupsreader = csv.reader(csvfile, delimiter=',')
    datalist = list(dupsreader)
    n = len(list(datalist))
    print(f"{n} records")
    if n < 3:
        print("Not enough data")
        sys.exit()
    x = 2
    while x < n:
        if datalist[x][0] == datalist[x-1][0]:   # duplicate?
            print(f"{datalist[x-1][0]},{datalist[x-1][1]},{datalist[x-1][2]},{datalist[x][1]},{datalist[x][2]}")
            x = x + 2    # skip duplicate
        else:
            print(f"{datalist[x-1][0]},{datalist[x-1][1]},{datalist[x-1][2]}")
            x = x + 1

Produces:

14 records
1,data1,data2
2,data3,data4
3,data5,data6
4,data7,data8
5,data9,data10,data11,data12
6,data13,data14
7,data15,data16
8,data17,data18,data19,data20
9,data21,data22,data23,data24

That does not write the data out as a CSV but I hope that is close enough. I am not sure of exactly what you need as output. Note that that data has a header. Also note that that is (I believe) reading the entire file into memory. That should be okay (computers have ample resources compared to half a century ago) unless it is a really big file.

RE: Dealing with duplicated data in a CSV file - Pedroski55 - Sep-05-2021

I'm sure the experts here can do this much more elegantly, but this works.

After doing PHP all morning, Python is a pleasant relief!! (I can't handle PHP)

import csv

path2csv = '/home/pedro/myPython/csv/randomdups.csv'

infile = open(path2csv)
# read the data file in
answers = csv.reader(infile)

# csv.reader is annoying, it's gone if you have to repeat,
# so, at least while you are experimenting, read to data[] first
# data will be a list of lists
# you can use it more than one time

data = []

# read answers to data, a list of lists
for row in answers:
    data.append(row)

# for info
for d in data:
    print(d)

"""
['col1', 'col2', 'col3']
['eggs', '25', '28']
['bananas', '3', '46']
['diamonds', '54', '63']
['apples', '15', '12']
['pears', '55', '11']
['pumpkins', '2', '22']
['eggs', '9', '8']
['bananas', '99', '101']
['apples', '14', '33']
['pears', '61', '17']
['pumpkins', '87', '45']
['rust', '13', '87']
['eggs', '88', '46']
['bananas', '89', '47']
['apples', '90', '48']
['pears', '91', '49']
['pumpkins', '92', '50']
"""

# get rid of the column headers
del data[0]

# there can be no duplicates in a set
# declare an empty set

unique_items = set()

# get a set of all items 

for d in data:
    unique_items.add(d[0])

# just for info
for u in unique_items:
    print(u)

# make a dictionary where all values are 0

item_num_dict = {}

for item in unique_items:
    item_num_dict[item] = 0

# now count the number of occurences of each item
for item in unique_items:    
    for d in data:
        if d[0] == item:
            item_num_dict[item] +=1

           
savepath = '/home/pedro/myPython/csv/'

# get first example of a duplicate key in a dictionary

first_example_data = {}

# a function to get the first example of a duplicate key

def getFirstExample(key):
    for d in data:
        if key == d[0] and item_num_dict[key] > 1:
            # get the data as a tuple
            first_example_data[key] = (d[1], d[2])
            # bale out after first example
            return

# get the first example of duplicate items and save to a dictionary: first_example_data
    
for key in item_num_dict.keys():
    getFirstExample(key)
     
for key in first_example_data.keys():
        info_list = [key, first_example_data[key][0], first_example_data[key][1]]           
        savename = savepath + key + '_first_example.csv'
        with open(savename, mode='w') as f:
            f_writer = csv.writer(f, delimiter=',', quotechar='"', quoting=csv.QUOTE_MINIMAL)
            fieldnames = ['item', 'data1', 'data2']
            f_writer.writerow(fieldnames)
            f_writer.writerow(info_list)
            print('First duplicate values saved to', savename)

print('Makes a change from that complicated PHP shit ... ')

And if you're offering me diamonds and rust
I've already paid

RE: Dealing with duplicated data in a CSV file - SamHobbs - Sep-05-2021

(Sep-05-2021, 08:12 AM)Pedroski55 Wrote: I'm sure the experts here can do this much more elegantly, but this works.

I assume in the future you would prefer that I leave your questions for the experts.

Note that I began by using csvreader.next but when I searched for a way to detect the end of the file I could not find an "elegant" solution. Most other languages including COBOL would have an easy way to detect the end.

RE: Dealing with duplicated data in a CSV file - Gribouillis - Sep-05-2021

If the file is already sorted on the first column, you can invoke itertools.groupby()

# untested code but you get the idea
import csv
import itertools as itt
from operator import itemgetter

def unique_rows(rows):
    for key, group in itt.groupby(rows, key=itemgetter(0)):
        yield next(group)

def main():
    with open('input.csv') as ifh, open('output.csv', 'w') as ofh:
        rd = csv.reader(ifh)
        wt = csv.writer(ofh)
        wt.writerows(unique_rows(rd))

if __name__ == '__main__':
    main()

RE: Dealing with duplicated data in a CSV file - SamHobbs - Sep-05-2021

(Sep-04-2021, 06:15 PM)bts001 Wrote: Store the extracted data in a new column or seperate csv file

(Sep-05-2021, 06:44 PM)Gribouillis Wrote: you can invoke itertools.groupby()

Storing the data from the duplicates is the tricky part.

RE: Dealing with duplicated data in a CSV file - Gribouillis - Sep-05-2021

SamHobbs Wrote:Storing the data from the duplicates is the tricky part.

Oh I see what you mean. It does not seem that tricky, you could replace the unique_rows() function above with this one

def unique_rows(rows):
    for key, group in itt.groupby(rows, key=itemgetter(0)):
        yield (key,) + tuple(itt.chain.from_iterable(r[1:] for r in group))

RE: Dealing with duplicated data in a CSV file - Pedroski55 - Sep-05-2021

(Sep-05-2021, 05:57 PM)SamHobbs Wrote:
(Sep-05-2021, 08:12 AM)Pedroski55 Wrote: I'm sure the experts here can do this much more elegantly, but this works.
I assume in the future you would prefer that I leave your questions for the experts.

Not at all!
I am very grateful for all answers!
I just don't consider myself proficient.
I can usually make things work, but, elegant??