Python Forum
Improving A Mean Algorithm
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Improving A Mean Algorithm
#1
Hello, fairly new to python, and especially new to numpy.

Mission: I was tasked with writing a function that will take the input of an arbitrary number of files each of which has an array that's filled with random numbers. The arrays within the files are all the same size. For each cell in the array, take that same cell from each file and find the mean for it and then return the results as an array that's the same dimensions as those inside the files.

My solution: I'm really new to Numpy, and so many of the attempts I've made in the last few days have been dead ends or too complicated to manage. I've tried putting all the rows in an array and calculating mean on each, but then I didn't know how to get them back into shape. Anyway, please look at the the code below, which is my solution, and give me any feedback for a more readable elegant solution. What I really want is the ability to use NumPy's mean function, since it's more optimized than mine.

import numpy as np

def mean_datasets(input_files):

    # Read-In Files to get array shape and file count
    file_count = 0
    for file in input_files:
        data = np.genfromtxt(file, delimiter=',')
        columns = data.shape[1]
        rows = data.shape[0]
        file_count += 1

    # Initilize the calculation array with file information
    calc_array = np.zeros([rows,columns])

    # Go through each file and sum the same cell per file
    for file in input_files:
        data = np.genfromtxt(file, delimiter=',')
        row_num = 0
        for row in data:
            column_num = 0
            for cell in row:
                calc_array[row_num,column_num] += cell
                column_num += 1
            row_num += 1

    # Go through the calculation array and find the mean for each cell
    row_num = 0
    for row in calc_array:
        column_num = 0
        for cell in row:
            calc_array[row_num, column_num] = round(cell/file_count, 1)
            column_num += 1
        row_num += 1

    return calc_array


test_datasets = mean_datasets(['data1.csv', 'data2.csv', 'data3.csv']) #'data4.csv', 'data5.csv', 'data6.csv'])
print(test_datasets)
Here's an example dataset:

-9.4610,-0.9349,8.5322,1.0458
0.6367,-3.5322,0.5127,-3.8569
3.9008,7.1903,-9.1945,-4.0130
Thank you for any help, advice, feedback!
Reply
#2
It never fails, I post my problem in a forum, and then I solve it myself. Here's the shorter and more elegant version of the code, for anyone that's are interested.

import numpy as np

def mean_datasets(input_files):
    data = [np.genfromtxt(file, delimiter=',') for file in input_files]
    mean = np.round(np.mean(data, axis = 0), 1)
    
    return mean
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  Are there any techniques for improving logloss? AlekseyPython 0 1,412 Mar-20-2021, 04:37 AM
Last Post: AlekseyPython
  Genetic Algorithm Tetris Python not improving Fanto88 0 1,714 Mar-06-2021, 09:16 PM
Last Post: Fanto88
  improving bot with pandas maman420 0 21,346 Jun-01-2019, 08:34 PM
Last Post: maman420

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020