Posts: 2
Threads: 1
Joined: Sep 2021
I am new to the programming world and I am looking for advice on the scenario below:
I have a csv file with four columns of data and I want to extract specific cells within the file. I want to do this through these steps:
1) Analyze the first column for duplicates
2) Using the first duplicate row, extract the value in the second and third column.
3) Store the extracted data in a new column or seperate csv file
4) Repeat for all duplicates
Note: I am not trying to remove duplicates, in fact I am trying to target them and keep only the first duplicate row of each.
What would be the easiest way to approach this?
Thanks!
Posts: 52
Threads: 3
Joined: Sep 2021
Sep-04-2021, 10:37 PM
(This post was last modified: Sep-04-2021, 10:38 PM by SamHobbs.)
There are a couple of details that should be clarified.
First, can you assume the data is sorted by the first column? Can you rely on all duplicates to be together?
Second, is it possible for there to be more than one duplicate? If so then what do you need to do for more than one duplicate?
Posts: 2
Threads: 1
Joined: Sep 2021
(Sep-04-2021, 10:37 PM)SamHobbs Wrote: There are a couple of details that should be clarified.
First, can you assume the data is sorted by the first column? Can you rely on all duplicates to be together?
Second, is it possible for there to be more than one duplicate? If so then what do you need to do for more than one duplicate?
Yes the data is sorted by the first column, all duplicates will be together and at max there will only be one duplicate. For clarification, the first column will look something like 1,2,3,4,5,5,6,7,8,8,9,9,10, etc. What I need is the data in the corresponding row of the first time one of the duplicates is mentioned, so 1,2,3,4, 5,5,6,7, 8,8, 9,9,10 (the bolded rows are all I need).
Posts: 52
Threads: 3
Joined: Sep 2021
I am new to Python but I have been programming for half a century. The following uses techniques I might have used in COBOL but using Python of course. My data is a bit different from what you specify but I hope it is close enough. When you ask questions if you can provide a small amount of sample data as in the following then that helps those that want to help you. I am using:
key,f1,f2
1,data1,data2
2,data3,data4
3,data5,data6
4,data7,data8
5,data9,data10
5,data11,data12
6,data13,data14
7,data15,data16
8,data17,data18
8,data19,data20
9,data21,data22
9,data23,data24
10,data25,data26
The following:
import sys
import csv
with open('WithDuplicates.csv', newline='') as csvfile:
dupsreader = csv.reader(csvfile, delimiter=',')
datalist = list(dupsreader)
n = len(list(datalist))
print(f"{n} records")
if n < 3:
print("Not enough data")
sys.exit()
x = 2
while x < n:
if datalist[x][0] == datalist[x-1][0]: # duplicate?
print(f"{datalist[x-1][0]},{datalist[x-1][1]},{datalist[x-1][2]},{datalist[x][1]},{datalist[x][2]}")
x = x + 2 # skip duplicate
else:
print(f"{datalist[x-1][0]},{datalist[x-1][1]},{datalist[x-1][2]}")
x = x + 1 Produces:
14 records
1,data1,data2
2,data3,data4
3,data5,data6
4,data7,data8
5,data9,data10,data11,data12
6,data13,data14
7,data15,data16
8,data17,data18,data19,data20
9,data21,data22,data23,data24
That does not write the data out as a CSV but I hope that is close enough. I am not sure of exactly what you need as output. Note that that data has a header. Also note that that is (I believe) reading the entire file into memory. That should be okay (computers have ample resources compared to half a century ago) unless it is a really big file.
Posts: 1,094
Threads: 143
Joined: Jul 2017
I'm sure the experts here can do this much more elegantly, but this works.
After doing PHP all morning, Python is a pleasant relief!! (I can't handle PHP)
import csv
path2csv = '/home/pedro/myPython/csv/randomdups.csv'
infile = open(path2csv)
# read the data file in
answers = csv.reader(infile)
# csv.reader is annoying, it's gone if you have to repeat,
# so, at least while you are experimenting, read to data[] first
# data will be a list of lists
# you can use it more than one time
data = []
# read answers to data, a list of lists
for row in answers:
data.append(row)
# for info
for d in data:
print(d)
"""
['col1', 'col2', 'col3']
['eggs', '25', '28']
['bananas', '3', '46']
['diamonds', '54', '63']
['apples', '15', '12']
['pears', '55', '11']
['pumpkins', '2', '22']
['eggs', '9', '8']
['bananas', '99', '101']
['apples', '14', '33']
['pears', '61', '17']
['pumpkins', '87', '45']
['rust', '13', '87']
['eggs', '88', '46']
['bananas', '89', '47']
['apples', '90', '48']
['pears', '91', '49']
['pumpkins', '92', '50']
"""
# get rid of the column headers
del data[0]
# there can be no duplicates in a set
# declare an empty set
unique_items = set()
# get a set of all items
for d in data:
unique_items.add(d[0])
# just for info
for u in unique_items:
print(u)
# make a dictionary where all values are 0
item_num_dict = {}
for item in unique_items:
item_num_dict[item] = 0
# now count the number of occurences of each item
for item in unique_items:
for d in data:
if d[0] == item:
item_num_dict[item] +=1
savepath = '/home/pedro/myPython/csv/'
# get first example of a duplicate key in a dictionary
first_example_data = {}
# a function to get the first example of a duplicate key
def getFirstExample(key):
for d in data:
if key == d[0] and item_num_dict[key] > 1:
# get the data as a tuple
first_example_data[key] = (d[1], d[2])
# bale out after first example
return
# get the first example of duplicate items and save to a dictionary: first_example_data
for key in item_num_dict.keys():
getFirstExample(key)
for key in first_example_data.keys():
info_list = [key, first_example_data[key][0], first_example_data[key][1]]
savename = savepath + key + '_first_example.csv'
with open(savename, mode='w') as f:
f_writer = csv.writer(f, delimiter=',', quotechar='"', quoting=csv.QUOTE_MINIMAL)
fieldnames = ['item', 'data1', 'data2']
f_writer.writerow(fieldnames)
f_writer.writerow(info_list)
print('First duplicate values saved to', savename)
print('Makes a change from that complicated PHP shit ... ') And if you're offering me diamonds and rust
I've already paid
Posts: 52
Threads: 3
Joined: Sep 2021
(Sep-05-2021, 08:12 AM)Pedroski55 Wrote: I'm sure the experts here can do this much more elegantly, but this works. I assume in the future you would prefer that I leave your questions for the experts.
Note that I began by using csvreader.next but when I searched for a way to detect the end of the file I could not find an "elegant" solution. Most other languages including COBOL would have an easy way to detect the end.
Posts: 4,803
Threads: 77
Joined: Jan 2018
If the file is already sorted on the first column, you can invoke itertools.groupby()
# untested code but you get the idea
import csv
import itertools as itt
from operator import itemgetter
def unique_rows(rows):
for key, group in itt.groupby(rows, key=itemgetter(0)):
yield next(group)
def main():
with open('input.csv') as ifh, open('output.csv', 'w') as ofh:
rd = csv.reader(ifh)
wt = csv.writer(ofh)
wt.writerows(unique_rows(rd))
if __name__ == '__main__':
main()
Posts: 52
Threads: 3
Joined: Sep 2021
(Sep-04-2021, 06:15 PM)bts001 Wrote: Store the extracted data in a new column or seperate csv file (Sep-05-2021, 06:44 PM)Gribouillis Wrote: you can invoke itertools.groupby() Storing the data from the duplicates is the tricky part.
Posts: 4,803
Threads: 77
Joined: Jan 2018
Sep-05-2021, 09:14 PM
(This post was last modified: Sep-05-2021, 09:14 PM by Gribouillis.)
SamHobbs Wrote:Storing the data from the duplicates is the tricky part. Oh I see what you mean. It does not seem that tricky, you could replace the unique_rows() function above with this one
def unique_rows(rows):
for key, group in itt.groupby(rows, key=itemgetter(0)):
yield (key,) + tuple(itt.chain.from_iterable(r[1:] for r in group))
Posts: 1,094
Threads: 143
Joined: Jul 2017
(Sep-05-2021, 05:57 PM)SamHobbs Wrote: (Sep-05-2021, 08:12 AM)Pedroski55 Wrote: I'm sure the experts here can do this much more elegantly, but this works. I assume in the future you would prefer that I leave your questions for the experts.
Not at all!
I am very grateful for all answers!
I just don't consider myself proficient.
I can usually make things work, but, elegant??
|