Python Forum

Full Version: Combine Duplicate CVS Column Values
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Hi All,
This is my first cry for help Big Grin

I'm a Python Noob and I have been progressing fairly well (I started learning Python about a month ago). I have come to the point in my code that I simply cannot figure out. I am trying to take a simple .csv file and tally duplicates and combine values. I have tried dictionaries, tuples, sets, etc etc but I am simply missing something.

My production .csv is about 150k lines long and is a log for commodities market state change messages... so I created a basic example of what I am dealing with. I have pretty good luck with regex and could probably solve this problem via regex.. but I know that is slow.

Here is my example .csv... lets say I'm a car dealer that owns several car lots.. so I have duplicates in the list.

ford, 3
ford, 3
honda, 1
dodge, 8
dodge, 2
dodge, 5
ford, 6
dodge, 2
chevy, 1
dodge, 5

My goal is to have a combined list that would look like this... the order in the result does not matter.. I just need totals

chevy, 1
dodge, 22
ford, 9
honda, 1


So far I have coded the following that parses the .csv and creates a list of car makes, and another list that has the car make and quantity.


import csv

log = ('C:\\Users\\' + 'jcastleb' + '\\Desktop\\' + 'cars.csv')


f = open(log)
csv_f = csv.reader(f)

#Create empty lists
carlist = []
car_count = []
result = []

for row in csv_f:
    #Create list of Car Makes and Numbers
    car_count.append(row)

    #Create the list of cars
    if row[0] not in carlist:
        carlist.append(row[0])

    if row[0] not in result:
        result.append(row[0])

print(carlist)
print(car_count)
I get the following when I run the script:

Output:
['ford', 'honda', 'dodge', 'chevy'] [['ford', ' 3'], ['ford', ' 3'], ['honda', ' 1'], ['dodge', ' 8'], ['dodge', ' 2'], ['dodge', ' 5'], ['ford', ' 6'], ['dodge', ' 2'], ['chevy', ' 1'], ['dodge', ' 5']]


If anyone can advise on how I continue the code to get the total I would be greatly appreciative!  If there is a more efficient method then mine please do let me know. Big Grin

I have a hunch that a dictionary may be the way to go but I could not figure out how to append the key values Sad

Thanks all!
>>> cars={'ford':3,'chevy':2}
>>> 'ford' in cars
True
>>> 'dodge' in cars
False
>>> cars['ford']+=3
>>> cars['dodge']=5
>>> cars
{'chevy': 2, 'dodge': 5, 'ford': 6}
You can make it even simpler using  a defaultdict. A defaultdict creates the item for you with a default value
when you reference it:
>>>import collections
# The dict will set a default value by calling a function
# so we give it an anonymous function that always returns 0 
>>> cars=collections.defaultdict(lambda:0)
# checking what happens when we reference something we haven't got yet
>>> cars['ford']
0
>>> cars['ford']+=3
# We can blindly add stuff, "dodge' isn't in the dict yet
>>> cars['dodge']+=8
>>> cars['dodge']+=3
>>> cars
defaultdict(<function <lambda> at 0x7f0131d4d050>, {'dodge': 11, 'ford': 3})
Thank you Ofnuts! Your solution worked great. Big Grin Dance

This is the final code using the method you suggested.

import collections
import csv
from timeit import default_timer as timer
start = timer()
#Set up log path and csv reader
log = ('C:\\Users\\' + 'jcastleb' + '\\Desktop\\' + 'endexlog.csv')
f = open(log)
csv_f = csv.reader(f)
#Create default dictionary 
counts = collections.defaultdict(lambda:0)
#Create empty lists
market_types = []
market_msg = []
strip_counts = []
#Function to update Market mnessage list
def market_message_parser():
        for row in csv_f:
            market_msg.append(row)
#Function to update dictionary with market names
def dictionary_names():
    for row in market_msg:
        if row[2] not in market_types:
            counts[(row[2])]
 
#Function to pull number of strips per market and update strip counts
def strip_parser():
    for row in market_msg:
        pull = (row[2], row[5])
        strip_counts.append(pull)
        
#Run List functions
market_message_parser()
dictionary_names()
strip_parser()
#update dictionary with counts
for row in strip_counts:
    name = row[0]
    number = int(row[1])
    counts[name] += number
print('Final Dictionary calculated:')
print()
print(counts)
print()
print()
print('Time to calculate')
end = timer()
print(end - start)
Think it look better to set list in defaultdict.
The sum that list up.
Eg.
from collections import defaultdict

data = '''\
ford, 3
ford, 3
honda, 1
dodge, 8
dodge, 2
dodge, 5
ford, 6
dodge, 2
chevy, 1
dodge, 5'''

data = [i.split(', ') for i in data.split('\n')]
d = defaultdict(list)
[d[k].append(int(v)) for k, v in data]
print({k: sum(v) for (k, v) in d.items()})
Output:
{'dodge': 22, 'honda': 1, 'ford': 12, 'chevy': 1}