Apr-07-2018, 10:15 PM

I'm a self-trained, novice programmer, and need guidance to determine a more time-efficient way to analyze my data. I have a large dataset (200+ column sets, and each column set has two columns of x and y data, with roughly 166 rows), and want to complete linear regression tests on the x and y data within each set. The linear regression is an 11-row moving average, thus there are linear regression statistics (slope) for each row. This means a linear regression slope must be calculated for each day, and the adjacent 10 rows, for up to 166 linear regression per column set. When the difference between consecutive linear regression slopes is greater than or equal to a threshold (0.3), I'd like to identify and record the first occurrence. After recording/presenting the occurrence, I'd like to exit the analysis, and move on to the next column set to repeat the process.

The code below identifies the difference in linear regression slopes between rows for just one column set, and took roughly 7 minutes to run (exiting in row 235). At this rate, I'll be running code for 24+ hours. I'm sure there's a more efficient way, perhaps outside of openpyxl to complete this many linear regressions and test for a threshold difference - I just have no idea what it is. I only started using openpyxl because the data were in .xlsx files when I received them, and Python because I used it during an undergraduate programming course.

Any recommendations on code structure, alternative analysis methods or data format conversions would be helpful. If it's not already clear, my knowledge of coding is quite low. I greatly appreciate your patience in writing out what might otherwise be 'obvious' things everyone already knows.

The code below identifies the difference in linear regression slopes between rows for just one column set, and took roughly 7 minutes to run (exiting in row 235). At this rate, I'll be running code for 24+ hours. I'm sure there's a more efficient way, perhaps outside of openpyxl to complete this many linear regressions and test for a threshold difference - I just have no idea what it is. I only started using openpyxl because the data were in .xlsx files when I received them, and Python because I used it during an undergraduate programming course.

Any recommendations on code structure, alternative analysis methods or data format conversions would be helpful. If it's not already clear, my knowledge of coding is quite low. I greatly appreciate your patience in writing out what might otherwise be 'obvious' things everyone already knows.

from __future__ import print_function from scipy import stats from openpyxl import load_workbook ######################################################################### wb2 = load_workbook(filename='data1.xlsx',read_only=True) print(wb2.sheetnames) ws2 = wb2['summary'] mean = 15 # x data column reference stddev = mean + 36 # y data column reference for j in range(164, 330): i=j x = [ws2.cell(row=i,column=mean).value for i in range(i,i+11)] #11 row x data y = [ws2.cell(row=i,column=stddev).value for i in range(i,i+11)] #11 row y data x2 = [ws2.cell(row=i,column=mean).value for i in range(i+1,i+12)] #11 row x data (starting at next row down) y2 = [ws2.cell(row=i,column=stddev).value for i in range(i+1,i+12)] #11 row y data (starting at next row down) slope, intercept, r_value, p_value, std_err = stats.linregress(x,y) slope2, intercept2, r_value2, p_value2, std_err2 = stats.linregress(x2,y2) thresholdslope=abs(slope2-slope) if thresholdslope>=0.3: threshold_value=ws2.cell(row=i+6,column=mean) threshold_date=ws2.cell(row=i+6,column=3) print("threshold slope=",thresholdslope) print("i=",i) print("i+6=",i+6) print("threshold value (i+6)",threshold_value.value) print("threshold date=",threshold_date.value) break