Dealing with duplicated data in a CSV file - Printable Version +- Python Forum (https://python-forum.io) +-- Forum: Python Coding (https://python-forum.io/forum-7.html) +--- Forum: General Coding Help (https://python-forum.io/forum-8.html) +--- Thread: Dealing with duplicated data in a CSV file (/thread-34829.html) Pages:
1
2
|
Dealing with duplicated data in a CSV file - bts001 - Sep-04-2021 I am new to the programming world and I am looking for advice on the scenario below: I have a csv file with four columns of data and I want to extract specific cells within the file. I want to do this through these steps: 1) Analyze the first column for duplicates 2) Using the first duplicate row, extract the value in the second and third column. 3) Store the extracted data in a new column or seperate csv file 4) Repeat for all duplicates Note: I am not trying to remove duplicates, in fact I am trying to target them and keep only the first duplicate row of each. What would be the easiest way to approach this? Thanks! RE: Dealing with duplicated data in a CSV file - SamHobbs - Sep-04-2021 There are a couple of details that should be clarified. First, can you assume the data is sorted by the first column? Can you rely on all duplicates to be together? Second, is it possible for there to be more than one duplicate? If so then what do you need to do for more than one duplicate? RE: Dealing with duplicated data in a CSV file - bts001 - Sep-04-2021 (Sep-04-2021, 10:37 PM)SamHobbs Wrote: There are a couple of details that should be clarified. Yes the data is sorted by the first column, all duplicates will be together and at max there will only be one duplicate. For clarification, the first column will look something like 1,2,3,4,5,5,6,7,8,8,9,9,10, etc. What I need is the data in the corresponding row of the first time one of the duplicates is mentioned, so 1,2,3,4,5,5,6,7,8,8,9,9,10 (the bolded rows are all I need). RE: Dealing with duplicated data in a CSV file - SamHobbs - Sep-05-2021 I am new to Python but I have been programming for half a century. The following uses techniques I might have used in COBOL but using Python of course. My data is a bit different from what you specify but I hope it is close enough. When you ask questions if you can provide a small amount of sample data as in the following then that helps those that want to help you. I am using: key,f1,f2 1,data1,data2 2,data3,data4 3,data5,data6 4,data7,data8 5,data9,data10 5,data11,data12 6,data13,data14 7,data15,data16 8,data17,data18 8,data19,data20 9,data21,data22 9,data23,data24 10,data25,data26 The following: import sys import csv with open('WithDuplicates.csv', newline='') as csvfile: dupsreader = csv.reader(csvfile, delimiter=',') datalist = list(dupsreader) n = len(list(datalist)) print(f"{n} records") if n < 3: print("Not enough data") sys.exit() x = 2 while x < n: if datalist[x][0] == datalist[x-1][0]: # duplicate? print(f"{datalist[x-1][0]},{datalist[x-1][1]},{datalist[x-1][2]},{datalist[x][1]},{datalist[x][2]}") x = x + 2 # skip duplicate else: print(f"{datalist[x-1][0]},{datalist[x-1][1]},{datalist[x-1][2]}") x = x + 1Produces: 14 records 1,data1,data2 2,data3,data4 3,data5,data6 4,data7,data8 5,data9,data10,data11,data12 6,data13,data14 7,data15,data16 8,data17,data18,data19,data20 9,data21,data22,data23,data24 That does not write the data out as a CSV but I hope that is close enough. I am not sure of exactly what you need as output. Note that that data has a header. Also note that that is (I believe) reading the entire file into memory. That should be okay (computers have ample resources compared to half a century ago) unless it is a really big file. RE: Dealing with duplicated data in a CSV file - Pedroski55 - Sep-05-2021 I'm sure the experts here can do this much more elegantly, but this works. After doing PHP all morning, Python is a pleasant relief!! (I can't handle PHP) import csv path2csv = '/home/pedro/myPython/csv/randomdups.csv' infile = open(path2csv) # read the data file in answers = csv.reader(infile) # csv.reader is annoying, it's gone if you have to repeat, # so, at least while you are experimenting, read to data[] first # data will be a list of lists # you can use it more than one time data = [] # read answers to data, a list of lists for row in answers: data.append(row) # for info for d in data: print(d) """ ['col1', 'col2', 'col3'] ['eggs', '25', '28'] ['bananas', '3', '46'] ['diamonds', '54', '63'] ['apples', '15', '12'] ['pears', '55', '11'] ['pumpkins', '2', '22'] ['eggs', '9', '8'] ['bananas', '99', '101'] ['apples', '14', '33'] ['pears', '61', '17'] ['pumpkins', '87', '45'] ['rust', '13', '87'] ['eggs', '88', '46'] ['bananas', '89', '47'] ['apples', '90', '48'] ['pears', '91', '49'] ['pumpkins', '92', '50'] """ # get rid of the column headers del data[0] # there can be no duplicates in a set # declare an empty set unique_items = set() # get a set of all items for d in data: unique_items.add(d[0]) # just for info for u in unique_items: print(u) # make a dictionary where all values are 0 item_num_dict = {} for item in unique_items: item_num_dict[item] = 0 # now count the number of occurences of each item for item in unique_items: for d in data: if d[0] == item: item_num_dict[item] +=1 savepath = '/home/pedro/myPython/csv/' # get first example of a duplicate key in a dictionary first_example_data = {} # a function to get the first example of a duplicate key def getFirstExample(key): for d in data: if key == d[0] and item_num_dict[key] > 1: # get the data as a tuple first_example_data[key] = (d[1], d[2]) # bale out after first example return # get the first example of duplicate items and save to a dictionary: first_example_data for key in item_num_dict.keys(): getFirstExample(key) for key in first_example_data.keys(): info_list = [key, first_example_data[key][0], first_example_data[key][1]] savename = savepath + key + '_first_example.csv' with open(savename, mode='w') as f: f_writer = csv.writer(f, delimiter=',', quotechar='"', quoting=csv.QUOTE_MINIMAL) fieldnames = ['item', 'data1', 'data2'] f_writer.writerow(fieldnames) f_writer.writerow(info_list) print('First duplicate values saved to', savename) print('Makes a change from that complicated PHP shit ... ')And if you're offering me diamonds and rust I've already paid RE: Dealing with duplicated data in a CSV file - SamHobbs - Sep-05-2021 (Sep-05-2021, 08:12 AM)Pedroski55 Wrote: I'm sure the experts here can do this much more elegantly, but this works.I assume in the future you would prefer that I leave your questions for the experts. Note that I began by using csvreader.next but when I searched for a way to detect the end of the file I could not find an "elegant" solution. Most other languages including COBOL would have an easy way to detect the end. RE: Dealing with duplicated data in a CSV file - Gribouillis - Sep-05-2021 If the file is already sorted on the first column, you can invoke itertools.groupby() # untested code but you get the idea import csv import itertools as itt from operator import itemgetter def unique_rows(rows): for key, group in itt.groupby(rows, key=itemgetter(0)): yield next(group) def main(): with open('input.csv') as ifh, open('output.csv', 'w') as ofh: rd = csv.reader(ifh) wt = csv.writer(ofh) wt.writerows(unique_rows(rd)) if __name__ == '__main__': main() RE: Dealing with duplicated data in a CSV file - SamHobbs - Sep-05-2021 (Sep-04-2021, 06:15 PM)bts001 Wrote: Store the extracted data in a new column or seperate csv file (Sep-05-2021, 06:44 PM)Gribouillis Wrote: you can invoke itertools.groupby()Storing the data from the duplicates is the tricky part. RE: Dealing with duplicated data in a CSV file - Gribouillis - Sep-05-2021 SamHobbs Wrote:Storing the data from the duplicates is the tricky part.Oh I see what you mean. It does not seem that tricky, you could replace the unique_rows() function above with this onedef unique_rows(rows): for key, group in itt.groupby(rows, key=itemgetter(0)): yield (key,) + tuple(itt.chain.from_iterable(r[1:] for r in group)) RE: Dealing with duplicated data in a CSV file - Pedroski55 - Sep-05-2021 (Sep-05-2021, 05:57 PM)SamHobbs Wrote:(Sep-05-2021, 08:12 AM)Pedroski55 Wrote: I'm sure the experts here can do this much more elegantly, but this works.I assume in the future you would prefer that I leave your questions for the experts. Not at all! I am very grateful for all answers! I just don't consider myself proficient. I can usually make things work, but, elegant?? |