Python Forum
Dealing with duplicated data in a CSV file
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Dealing with duplicated data in a CSV file
#1
I am new to the programming world and I am looking for advice on the scenario below:

I have a csv file with four columns of data and I want to extract specific cells within the file. I want to do this through these steps:
1) Analyze the first column for duplicates
2) Using the first duplicate row, extract the value in the second and third column.
3) Store the extracted data in a new column or seperate csv file
4) Repeat for all duplicates

Note: I am not trying to remove duplicates, in fact I am trying to target them and keep only the first duplicate row of each.

What would be the easiest way to approach this?

Thanks!
Reply
#2
There are a couple of details that should be clarified.

First, can you assume the data is sorted by the first column? Can you rely on all duplicates to be together?

Second, is it possible for there to be more than one duplicate? If so then what do you need to do for more than one duplicate?
Reply
#3
(Sep-04-2021, 10:37 PM)SamHobbs Wrote: There are a couple of details that should be clarified.

First, can you assume the data is sorted by the first column? Can you rely on all duplicates to be together?

Second, is it possible for there to be more than one duplicate? If so then what do you need to do for more than one duplicate?

Yes the data is sorted by the first column, all duplicates will be together and at max there will only be one duplicate. For clarification, the first column will look something like 1,2,3,4,5,5,6,7,8,8,9,9,10, etc. What I need is the data in the corresponding row of the first time one of the duplicates is mentioned, so 1,2,3,4,5,5,6,7,8,8,9,9,10 (the bolded rows are all I need).
Reply
#4
I am new to Python but I have been programming for half a century. The following uses techniques I might have used in COBOL but using Python of course. My data is a bit different from what you specify but I hope it is close enough. When you ask questions if you can provide a small amount of sample data as in the following then that helps those that want to help you. I am using:

key,f1,f2
1,data1,data2
2,data3,data4
3,data5,data6
4,data7,data8
5,data9,data10
5,data11,data12
6,data13,data14
7,data15,data16
8,data17,data18
8,data19,data20
9,data21,data22
9,data23,data24
10,data25,data26

The following:

import sys
import csv
with open('WithDuplicates.csv', newline='') as csvfile:
    dupsreader = csv.reader(csvfile, delimiter=',')
    datalist = list(dupsreader)
    n = len(list(datalist))
    print(f"{n} records")
    if n < 3:
        print("Not enough data")
        sys.exit()
    x = 2
    while x < n:
        if datalist[x][0] == datalist[x-1][0]:   # duplicate?
            print(f"{datalist[x-1][0]},{datalist[x-1][1]},{datalist[x-1][2]},{datalist[x][1]},{datalist[x][2]}")
            x = x + 2    # skip duplicate
        else:
            print(f"{datalist[x-1][0]},{datalist[x-1][1]},{datalist[x-1][2]}")
            x = x + 1
Produces:

14 records
1,data1,data2
2,data3,data4
3,data5,data6
4,data7,data8
5,data9,data10,data11,data12
6,data13,data14
7,data15,data16
8,data17,data18,data19,data20
9,data21,data22,data23,data24

That does not write the data out as a CSV but I hope that is close enough. I am not sure of exactly what you need as output. Note that that data has a header. Also note that that is (I believe) reading the entire file into memory. That should be okay (computers have ample resources compared to half a century ago) unless it is a really big file.
Reply
#5
I'm sure the experts here can do this much more elegantly, but this works.

After doing PHP all morning, Python is a pleasant relief!! (I can't handle PHP)

import csv

path2csv = '/home/pedro/myPython/csv/randomdups.csv'

infile = open(path2csv)
# read the data file in
answers = csv.reader(infile)

# csv.reader is annoying, it's gone if you have to repeat,
# so, at least while you are experimenting, read to data[] first
# data will be a list of lists
# you can use it more than one time

data = []

# read answers to data, a list of lists
for row in answers:
    data.append(row)

# for info
for d in data:
    print(d)

"""
['col1', 'col2', 'col3']
['eggs', '25', '28']
['bananas', '3', '46']
['diamonds', '54', '63']
['apples', '15', '12']
['pears', '55', '11']
['pumpkins', '2', '22']
['eggs', '9', '8']
['bananas', '99', '101']
['apples', '14', '33']
['pears', '61', '17']
['pumpkins', '87', '45']
['rust', '13', '87']
['eggs', '88', '46']
['bananas', '89', '47']
['apples', '90', '48']
['pears', '91', '49']
['pumpkins', '92', '50']
"""

# get rid of the column headers
del data[0]

# there can be no duplicates in a set
# declare an empty set

unique_items = set()

# get a set of all items 

for d in data:
    unique_items.add(d[0])

# just for info
for u in unique_items:
    print(u)

# make a dictionary where all values are 0

item_num_dict = {}

for item in unique_items:
    item_num_dict[item] = 0

# now count the number of occurences of each item
for item in unique_items:    
    for d in data:
        if d[0] == item:
            item_num_dict[item] +=1

           
savepath = '/home/pedro/myPython/csv/'

# get first example of a duplicate key in a dictionary

first_example_data = {}

# a function to get the first example of a duplicate key

def getFirstExample(key):
    for d in data:
        if key == d[0] and item_num_dict[key] > 1:
            # get the data as a tuple
            first_example_data[key] = (d[1], d[2])
            # bale out after first example
            return

# get the first example of duplicate items and save to a dictionary: first_example_data
    
for key in item_num_dict.keys():
    getFirstExample(key)
     
for key in first_example_data.keys():
        info_list = [key, first_example_data[key][0], first_example_data[key][1]]           
        savename = savepath + key + '_first_example.csv'
        with open(savename, mode='w') as f:
            f_writer = csv.writer(f, delimiter=',', quotechar='"', quoting=csv.QUOTE_MINIMAL)
            fieldnames = ['item', 'data1', 'data2']
            f_writer.writerow(fieldnames)
            f_writer.writerow(info_list)
            print('First duplicate values saved to', savename)

print('Makes a change from that complicated PHP shit ... ')
And if you're offering me diamonds and rust
I've already paid
Reply
#6
(Sep-05-2021, 08:12 AM)Pedroski55 Wrote: I'm sure the experts here can do this much more elegantly, but this works.
I assume in the future you would prefer that I leave your questions for the experts.

Note that I began by using csvreader.next but when I searched for a way to detect the end of the file I could not find an "elegant" solution. Most other languages including COBOL would have an easy way to detect the end.
Reply
#7
If the file is already sorted on the first column, you can invoke itertools.groupby()
# untested code but you get the idea
import csv
import itertools as itt
from operator import itemgetter

def unique_rows(rows):
    for key, group in itt.groupby(rows, key=itemgetter(0)):
        yield next(group)

def main():
    with open('input.csv') as ifh, open('output.csv', 'w') as ofh:
        rd = csv.reader(ifh)
        wt = csv.writer(ofh)
        wt.writerows(unique_rows(rd))

if __name__ == '__main__':
    main()
Reply
#8
(Sep-04-2021, 06:15 PM)bts001 Wrote: Store the extracted data in a new column or seperate csv file
(Sep-05-2021, 06:44 PM)Gribouillis Wrote: you can invoke itertools.groupby()
Storing the data from the duplicates is the tricky part.
Reply
#9
SamHobbs Wrote:Storing the data from the duplicates is the tricky part.
Oh I see what you mean. It does not seem that tricky, you could replace the unique_rows() function above with this one
def unique_rows(rows):
    for key, group in itt.groupby(rows, key=itemgetter(0)):
        yield (key,) + tuple(itt.chain.from_iterable(r[1:] for r in group))
SamHobbs likes this post
Reply
#10
(Sep-05-2021, 05:57 PM)SamHobbs Wrote:
(Sep-05-2021, 08:12 AM)Pedroski55 Wrote: I'm sure the experts here can do this much more elegantly, but this works.
I assume in the future you would prefer that I leave your questions for the experts.

Not at all!
I am very grateful for all answers!
I just don't consider myself proficient.
I can usually make things work, but, elegant??
SamHobbs likes this post
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  Excel from SAP - dealing with formats and VBA MasterOfDestr 7 448 Feb-25-2024, 12:23 PM
Last Post: Pedroski55
  UnicodeEncodeError - Dealing with Japanese Characters fioranosnake 2 2,353 Jul-07-2022, 08:43 PM
Last Post: fioranosnake
  xml file creation from an XML file template and data from an excel file naji_python 1 2,070 Dec-21-2020, 03:24 PM
Last Post: Gribouillis
  Counter of the duplicated packets from a pcap file salwa17 8 4,145 Jun-26-2020, 11:31 PM
Last Post: salwa17
  How to save CSV file data into the Azure Data Lake Storage Gen2 table? Mangesh121 0 2,079 Jun-26-2020, 11:59 AM
Last Post: Mangesh121
  Dealing with a .json nightmare... ideas? t4keheart 10 4,249 Jan-28-2020, 10:12 PM
Last Post: t4keheart
  Dealing with Exponential data parthi1705 11 9,585 May-30-2019, 10:16 AM
Last Post: buran
  Dealing with multiple context managers heras 5 4,601 Nov-16-2018, 09:01 AM
Last Post: DeaD_EyE
  dealing with big data of timestamp LMQ 0 2,142 Jul-27-2018, 01:23 PM
Last Post: LMQ
  dealing with spaces in file names AceScottie 5 74,640 Jun-02-2018, 01:06 PM
Last Post: AceScottie

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020