Improving A Mean Algorithm

dwainetrain · Aug-05-2017, 06:43 PM

Hello, fairly new to python, and especially new to numpy.

Mission: I was tasked with writing a function that will take the input of an arbitrary number of files each of which has an array that's filled with random numbers. The arrays within the files are all the same size. For each cell in the array, take that same cell from each file and find the mean for it and then return the results as an array that's the same dimensions as those inside the files.

My solution: I'm really new to Numpy, and so many of the attempts I've made in the last few days have been dead ends or too complicated to manage. I've tried putting all the rows in an array and calculating mean on each, but then I didn't know how to get them back into shape. Anyway, please look at the the code below, which is my solution, and give me any feedback for a more readable elegant solution. What I really want is the ability to use NumPy's mean function, since it's more optimized than mine.

import numpy as np

def mean_datasets(input_files):

    # Read-In Files to get array shape and file count
    file_count = 0
    for file in input_files:
        data = np.genfromtxt(file, delimiter=',')
        columns = data.shape[1]
        rows = data.shape[0]
        file_count += 1

    # Initilize the calculation array with file information
    calc_array = np.zeros([rows,columns])

    # Go through each file and sum the same cell per file
    for file in input_files:
        data = np.genfromtxt(file, delimiter=',')
        row_num = 0
        for row in data:
            column_num = 0
            for cell in row:
                calc_array[row_num,column_num] += cell
                column_num += 1
            row_num += 1

    # Go through the calculation array and find the mean for each cell
    row_num = 0
    for row in calc_array:
        column_num = 0
        for cell in row:
            calc_array[row_num, column_num] = round(cell/file_count, 1)
            column_num += 1
        row_num += 1

    return calc_array


test_datasets = mean_datasets(['data1.csv', 'data2.csv', 'data3.csv']) #'data4.csv', 'data5.csv', 'data6.csv'])
print(test_datasets)

Here's an example dataset:

-9.4610,-0.9349,8.5322,1.0458
0.6367,-3.5322,0.5127,-3.8569
3.9008,7.1903,-9.1945,-4.0130

Thank you for any help, advice, feedback!

dwainetrain · Aug-05-2017, 07:58 PM

It never fails, I post my problem in a forum, and then I solve it myself. Here's the shorter and more elegant version of the code, for anyone that's are interested.

import numpy as np

def mean_datasets(input_files):
    data = [np.genfromtxt(file, delimiter=',') for file in input_files]
    mean = np.round(np.mean(data, axis = 0), 1)
    
    return mean

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	Are there any techniques for improving logloss?	AlekseyPython	0	1,484	Mar-20-2021, 04:37 AM Last Post: AlekseyPython
	Genetic Algorithm Tetris Python not improving	Fanto88	0	1,795	Mar-06-2021, 09:16 PM Last Post: Fanto88
	improving bot with pandas	maman420	0	30,204	Jun-01-2019, 08:34 PM Last Post: maman420

Improving A Mean Algorithm

User Panel Messages

Announcements