Dealing with duplicated data in a CSV file

bts001 · Sep-04-2021, 06:15 PM

I am new to the programming world and I am looking for advice on the scenario below:

I have a csv file with four columns of data and I want to extract specific cells within the file. I want to do this through these steps:
1) Analyze the first column for duplicates
2) Using the first duplicate row, extract the value in the second and third column.
3) Store the extracted data in a new column or seperate csv file
4) Repeat for all duplicates

Note: I am not trying to remove duplicates, in fact I am trying to target them and keep only the first duplicate row of each.

What would be the easiest way to approach this?

Thanks!

SamHobbs · (This post was last modified: Sep-04-2021, 10:38 PM by SamHobbs.)

There are a couple of details that should be clarified.

First, can you assume the data is sorted by the first column? Can you rely on all duplicates to be together?

Second, is it possible for there to be more than one duplicate? If so then what do you need to do for more than one duplicate?

bts001 · Sep-04-2021, 10:58 PM

(Sep-04-2021, 10:37 PM)SamHobbs Wrote: There are a couple of details that should be clarified.

First, can you assume the data is sorted by the first column? Can you rely on all duplicates to be together?

Second, is it possible for there to be more than one duplicate? If so then what do you need to do for more than one duplicate?

Yes the data is sorted by the first column, all duplicates will be together and at max there will only be one duplicate. For clarification, the first column will look something like 1,2,3,4,5,5,6,7,8,8,9,9,10, etc. What I need is the data in the corresponding row of the first time one of the duplicates is mentioned, so 1,2,3,4,5,5,6,7,8,8,9,9,10 (the bolded rows are all I need).

SamHobbs · Sep-05-2021, 01:01 AM

I am new to Python but I have been programming for half a century. The following uses techniques I might have used in COBOL but using Python of course. My data is a bit different from what you specify but I hope it is close enough. When you ask questions if you can provide a small amount of sample data as in the following then that helps those that want to help you. I am using:

key,f1,f2
1,data1,data2
2,data3,data4
3,data5,data6
4,data7,data8
5,data9,data10
5,data11,data12
6,data13,data14
7,data15,data16
8,data17,data18
8,data19,data20
9,data21,data22
9,data23,data24
10,data25,data26

The following:

import sys
import csv
with open('WithDuplicates.csv', newline='') as csvfile:
    dupsreader = csv.reader(csvfile, delimiter=',')
    datalist = list(dupsreader)
    n = len(list(datalist))
    print(f"{n} records")
    if n < 3:
        print("Not enough data")
        sys.exit()
    x = 2
    while x < n:
        if datalist[x][0] == datalist[x-1][0]:   # duplicate?
            print(f"{datalist[x-1][0]},{datalist[x-1][1]},{datalist[x-1][2]},{datalist[x][1]},{datalist[x][2]}")
            x = x + 2    # skip duplicate
        else:
            print(f"{datalist[x-1][0]},{datalist[x-1][1]},{datalist[x-1][2]}")
            x = x + 1

Produces:

14 records
1,data1,data2
2,data3,data4
3,data5,data6
4,data7,data8
5,data9,data10,data11,data12
6,data13,data14
7,data15,data16
8,data17,data18,data19,data20
9,data21,data22,data23,data24

That does not write the data out as a CSV but I hope that is close enough. I am not sure of exactly what you need as output. Note that that data has a header. Also note that that is (I believe) reading the entire file into memory. That should be okay (computers have ample resources compared to half a century ago) unless it is a really big file.

Pedroski55 · Sep-05-2021, 08:12 AM

I'm sure the experts here can do this much more elegantly, but this works.

After doing PHP all morning, Python is a pleasant relief!! (I can't handle PHP)

import csv

path2csv = '/home/pedro/myPython/csv/randomdups.csv'

infile = open(path2csv)
# read the data file in
answers = csv.reader(infile)

# csv.reader is annoying, it's gone if you have to repeat,
# so, at least while you are experimenting, read to data[] first
# data will be a list of lists
# you can use it more than one time

data = []

# read answers to data, a list of lists
for row in answers:
    data.append(row)

# for info
for d in data:
    print(d)

"""
['col1', 'col2', 'col3']
['eggs', '25', '28']
['bananas', '3', '46']
['diamonds', '54', '63']
['apples', '15', '12']
['pears', '55', '11']
['pumpkins', '2', '22']
['eggs', '9', '8']
['bananas', '99', '101']
['apples', '14', '33']
['pears', '61', '17']
['pumpkins', '87', '45']
['rust', '13', '87']
['eggs', '88', '46']
['bananas', '89', '47']
['apples', '90', '48']
['pears', '91', '49']
['pumpkins', '92', '50']
"""

# get rid of the column headers
del data[0]

# there can be no duplicates in a set
# declare an empty set

unique_items = set()

# get a set of all items 

for d in data:
    unique_items.add(d[0])

# just for info
for u in unique_items:
    print(u)

# make a dictionary where all values are 0

item_num_dict = {}

for item in unique_items:
    item_num_dict[item] = 0

# now count the number of occurences of each item
for item in unique_items:    
    for d in data:
        if d[0] == item:
            item_num_dict[item] +=1

           
savepath = '/home/pedro/myPython/csv/'

# get first example of a duplicate key in a dictionary

first_example_data = {}

# a function to get the first example of a duplicate key

def getFirstExample(key):
    for d in data:
        if key == d[0] and item_num_dict[key] > 1:
            # get the data as a tuple
            first_example_data[key] = (d[1], d[2])
            # bale out after first example
            return

# get the first example of duplicate items and save to a dictionary: first_example_data
    
for key in item_num_dict.keys():
    getFirstExample(key)
     
for key in first_example_data.keys():
        info_list = [key, first_example_data[key][0], first_example_data[key][1]]           
        savename = savepath + key + '_first_example.csv'
        with open(savename, mode='w') as f:
            f_writer = csv.writer(f, delimiter=',', quotechar='"', quoting=csv.QUOTE_MINIMAL)
            fieldnames = ['item', 'data1', 'data2']
            f_writer.writerow(fieldnames)
            f_writer.writerow(info_list)
            print('First duplicate values saved to', savename)

print('Makes a change from that complicated PHP shit ... ')

And if you're offering me diamonds and rust
I've already paid

SamHobbs · Sep-05-2021, 05:57 PM

(Sep-05-2021, 08:12 AM)Pedroski55 Wrote: I'm sure the experts here can do this much more elegantly, but this works.

I assume in the future you would prefer that I leave your questions for the experts.

Note that I began by using csvreader.next but when I searched for a way to detect the end of the file I could not find an "elegant" solution. Most other languages including COBOL would have an easy way to detect the end.

**Gribouillis** · Sep-05-2021, 06:44 PM

If the file is already sorted on the first column, you can invoke itertools.groupby()

# untested code but you get the idea
import csv
import itertools as itt
from operator import itemgetter

def unique_rows(rows):
    for key, group in itt.groupby(rows, key=itemgetter(0)):
        yield next(group)

def main():
    with open('input.csv') as ifh, open('output.csv', 'w') as ofh:
        rd = csv.reader(ifh)
        wt = csv.writer(ofh)
        wt.writerows(unique_rows(rd))

if __name__ == '__main__':
    main()

SamHobbs · Sep-05-2021, 08:23 PM

(Sep-04-2021, 06:15 PM)bts001 Wrote: Store the extracted data in a new column or seperate csv file

(Sep-05-2021, 06:44 PM)Gribouillis Wrote: you can invoke itertools.groupby()

Storing the data from the duplicates is the tricky part.

**Gribouillis** · (This post was last modified: Sep-05-2021, 09:14 PM by Gribouillis.)

SamHobbs Wrote:Storing the data from the duplicates is the tricky part.

Oh I see what you mean. It does not seem that tricky, you could replace the unique_rows() function above with this one

def unique_rows(rows):
    for key, group in itt.groupby(rows, key=itemgetter(0)):
        yield (key,) + tuple(itt.chain.from_iterable(r[1:] for r in group))

Pedroski55 · Sep-05-2021, 11:28 PM

(Sep-05-2021, 05:57 PM)SamHobbs Wrote:
(Sep-05-2021, 08:12 AM)Pedroski55 Wrote: I'm sure the experts here can do this much more elegantly, but this works.
I assume in the future you would prefer that I leave your questions for the experts.

Not at all!
I am very grateful for all answers!
I just don't consider myself proficient.
I can usually make things work, but, elegant??

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	Excel from SAP - dealing with formats and VBA	MasterOfDestr	7	2,542	Feb-25-2024, 12:23 PM Last Post: Pedroski55
	UnicodeEncodeError - Dealing with Japanese Characters	fioranosnake	2	4,353	Jul-07-2022, 08:43 PM Last Post: fioranosnake
	xml file creation from an XML file template and data from an excel file	naji_python	1	2,908	Dec-21-2020, 03:24 PM Last Post: Gribouillis
	Counter of the duplicated packets from a pcap file	salwa17	8	6,027	Jun-26-2020, 11:31 PM Last Post: salwa17
	How to save CSV file data into the Azure Data Lake Storage Gen2 table?	Mangesh121	0	2,679	Jun-26-2020, 11:59 AM Last Post: Mangesh121
	Dealing with a .json nightmare... ideas?	t4keheart	10	6,376	Jan-28-2020, 10:12 PM Last Post: t4keheart
	Dealing with Exponential data	parthi1705	11	13,077	May-30-2019, 10:16 AM Last Post: buran
	Dealing with multiple context managers	heras	5	7,402	Nov-16-2018, 09:01 AM Last Post: DeaD_EyE
	dealing with big data of timestamp	LMQ	0	2,653	Jul-27-2018, 01:23 PM Last Post: LMQ
	dealing with spaces in file names	AceScottie	5	84,407	Jun-02-2018, 01:06 PM Last Post: AceScottie

Dealing with duplicated data in a CSV file

User Panel Messages

Announcements