Python Forum
Dealing with duplicated data in a CSV file
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Dealing with duplicated data in a CSV file
#5
I'm sure the experts here can do this much more elegantly, but this works.

After doing PHP all morning, Python is a pleasant relief!! (I can't handle PHP)

import csv

path2csv = '/home/pedro/myPython/csv/randomdups.csv'

infile = open(path2csv)
# read the data file in
answers = csv.reader(infile)

# csv.reader is annoying, it's gone if you have to repeat,
# so, at least while you are experimenting, read to data[] first
# data will be a list of lists
# you can use it more than one time

data = []

# read answers to data, a list of lists
for row in answers:
    data.append(row)

# for info
for d in data:
    print(d)

"""
['col1', 'col2', 'col3']
['eggs', '25', '28']
['bananas', '3', '46']
['diamonds', '54', '63']
['apples', '15', '12']
['pears', '55', '11']
['pumpkins', '2', '22']
['eggs', '9', '8']
['bananas', '99', '101']
['apples', '14', '33']
['pears', '61', '17']
['pumpkins', '87', '45']
['rust', '13', '87']
['eggs', '88', '46']
['bananas', '89', '47']
['apples', '90', '48']
['pears', '91', '49']
['pumpkins', '92', '50']
"""

# get rid of the column headers
del data[0]

# there can be no duplicates in a set
# declare an empty set

unique_items = set()

# get a set of all items 

for d in data:
    unique_items.add(d[0])

# just for info
for u in unique_items:
    print(u)

# make a dictionary where all values are 0

item_num_dict = {}

for item in unique_items:
    item_num_dict[item] = 0

# now count the number of occurences of each item
for item in unique_items:    
    for d in data:
        if d[0] == item:
            item_num_dict[item] +=1

           
savepath = '/home/pedro/myPython/csv/'

# get first example of a duplicate key in a dictionary

first_example_data = {}

# a function to get the first example of a duplicate key

def getFirstExample(key):
    for d in data:
        if key == d[0] and item_num_dict[key] > 1:
            # get the data as a tuple
            first_example_data[key] = (d[1], d[2])
            # bale out after first example
            return

# get the first example of duplicate items and save to a dictionary: first_example_data
    
for key in item_num_dict.keys():
    getFirstExample(key)
     
for key in first_example_data.keys():
        info_list = [key, first_example_data[key][0], first_example_data[key][1]]           
        savename = savepath + key + '_first_example.csv'
        with open(savename, mode='w') as f:
            f_writer = csv.writer(f, delimiter=',', quotechar='"', quoting=csv.QUOTE_MINIMAL)
            fieldnames = ['item', 'data1', 'data2']
            f_writer.writerow(fieldnames)
            f_writer.writerow(info_list)
            print('First duplicate values saved to', savename)

print('Makes a change from that complicated PHP shit ... ')
And if you're offering me diamonds and rust
I've already paid
Reply


Messages In This Thread
RE: Dealing with duplicated data in a CSV file - by Pedroski55 - Sep-05-2021, 08:12 AM

Possibly Related Threads…
Thread Author Replies Views Last Post
  Excel from SAP - dealing with formats and VBA MasterOfDestr 7 590 Feb-25-2024, 12:23 PM
Last Post: Pedroski55
  UnicodeEncodeError - Dealing with Japanese Characters fioranosnake 2 2,495 Jul-07-2022, 08:43 PM
Last Post: fioranosnake
  xml file creation from an XML file template and data from an excel file naji_python 1 2,124 Dec-21-2020, 03:24 PM
Last Post: Gribouillis
  Counter of the duplicated packets from a pcap file salwa17 8 4,268 Jun-26-2020, 11:31 PM
Last Post: salwa17
  How to save CSV file data into the Azure Data Lake Storage Gen2 table? Mangesh121 0 2,114 Jun-26-2020, 11:59 AM
Last Post: Mangesh121
  Dealing with a .json nightmare... ideas? t4keheart 10 4,415 Jan-28-2020, 10:12 PM
Last Post: t4keheart
  Dealing with Exponential data parthi1705 11 9,819 May-30-2019, 10:16 AM
Last Post: buran
  Dealing with multiple context managers heras 5 4,726 Nov-16-2018, 09:01 AM
Last Post: DeaD_EyE
  dealing with big data of timestamp LMQ 0 2,179 Jul-27-2018, 01:23 PM
Last Post: LMQ
  dealing with spaces in file names AceScottie 5 75,268 Jun-02-2018, 01:06 PM
Last Post: AceScottie

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020