Python Forum
Improving A Mean Algorithm - Printable Version

+- Python Forum (https://python-forum.io)
+-- Forum: Python Coding (https://python-forum.io/forum-7.html)
+--- Forum: Data Science (https://python-forum.io/forum-44.html)
+--- Thread: Improving A Mean Algorithm (/thread-4288.html)



Improving A Mean Algorithm - dwainetrain - Aug-05-2017

Hello, fairly new to python, and especially new to numpy.

Mission: I was tasked with writing a function that will take the input of an arbitrary number of files each of which has an array that's filled with random numbers. The arrays within the files are all the same size. For each cell in the array, take that same cell from each file and find the mean for it and then return the results as an array that's the same dimensions as those inside the files.

My solution: I'm really new to Numpy, and so many of the attempts I've made in the last few days have been dead ends or too complicated to manage. I've tried putting all the rows in an array and calculating mean on each, but then I didn't know how to get them back into shape. Anyway, please look at the the code below, which is my solution, and give me any feedback for a more readable elegant solution. What I really want is the ability to use NumPy's mean function, since it's more optimized than mine.

import numpy as np

def mean_datasets(input_files):

    # Read-In Files to get array shape and file count
    file_count = 0
    for file in input_files:
        data = np.genfromtxt(file, delimiter=',')
        columns = data.shape[1]
        rows = data.shape[0]
        file_count += 1

    # Initilize the calculation array with file information
    calc_array = np.zeros([rows,columns])

    # Go through each file and sum the same cell per file
    for file in input_files:
        data = np.genfromtxt(file, delimiter=',')
        row_num = 0
        for row in data:
            column_num = 0
            for cell in row:
                calc_array[row_num,column_num] += cell
                column_num += 1
            row_num += 1

    # Go through the calculation array and find the mean for each cell
    row_num = 0
    for row in calc_array:
        column_num = 0
        for cell in row:
            calc_array[row_num, column_num] = round(cell/file_count, 1)
            column_num += 1
        row_num += 1

    return calc_array


test_datasets = mean_datasets(['data1.csv', 'data2.csv', 'data3.csv']) #'data4.csv', 'data5.csv', 'data6.csv'])
print(test_datasets)
Here's an example dataset:

-9.4610,-0.9349,8.5322,1.0458
0.6367,-3.5322,0.5127,-3.8569
3.9008,7.1903,-9.1945,-4.0130
Thank you for any help, advice, feedback!


RE: Improving A Mean Algorithm - dwainetrain - Aug-05-2017

It never fails, I post my problem in a forum, and then I solve it myself. Here's the shorter and more elegant version of the code, for anyone that's are interested.

import numpy as np

def mean_datasets(input_files):
    data = [np.genfromtxt(file, delimiter=',') for file in input_files]
    mean = np.round(np.mean(data, axis = 0), 1)
    
    return mean