Bottom Page

Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
 Alternative approach to iterate numerous linear regressions with xlsx data?
#1
I'm a self-trained, novice programmer, and need guidance to determine a more time-efficient way to analyze my data. I have a large dataset (200+ column sets, and each column set has two columns of x and y data, with roughly 166 rows), and want to complete linear regression tests on the x and y data within each set. The linear regression is an 11-row moving average, thus there are linear regression statistics (slope) for each row. This means a linear regression slope must be calculated for each day, and the adjacent 10 rows, for up to 166 linear regression per column set. When the difference between consecutive linear regression slopes is greater than or equal to a threshold (0.3), I'd like to identify and record the first occurrence. After recording/presenting the occurrence, I'd like to exit the analysis, and move on to the next column set to repeat the process.

The code below identifies the difference in linear regression slopes between rows for just one column set, and took roughly 7 minutes to run (exiting in row 235). At this rate, I'll be running code for 24+ hours. I'm sure there's a more efficient way, perhaps outside of openpyxl to complete this many linear regressions and test for a threshold difference - I just have no idea what it is. I only started using openpyxl because the data were in .xlsx files when I received them, and Python because I used it during an undergraduate programming course.

Any recommendations on code structure, alternative analysis methods or data format conversions would be helpful. If it's not already clear, my knowledge of coding is quite low. I greatly appreciate your patience in writing out what might otherwise be 'obvious' things everyone already knows.

from __future__ import print_function
from scipy import stats
from openpyxl import load_workbook

#########################################################################

wb2 = load_workbook(filename='data1.xlsx',read_only=True)
print(wb2.sheetnames)
ws2 = wb2['summary']

mean = 15 # x data column reference
stddev = mean + 36 # y data column reference

for j in range(164, 330):
    i=j
    x = [ws2.cell(row=i,column=mean).value for i in range(i,i+11)] #11 row x data
    y = [ws2.cell(row=i,column=stddev).value for i in range(i,i+11)] #11 row y data
    x2 = [ws2.cell(row=i,column=mean).value for i in range(i+1,i+12)] #11 row x data (starting at next row down)
    y2 = [ws2.cell(row=i,column=stddev).value for i in range(i+1,i+12)] #11 row y data (starting at next row down)
    slope, intercept, r_value, p_value, std_err = stats.linregress(x,y)
    slope2, intercept2, r_value2, p_value2, std_err2 = stats.linregress(x2,y2)
    thresholdslope=abs(slope2-slope)

    
    if thresholdslope>=0.3:
        threshold_value=ws2.cell(row=i+6,column=mean)
        threshold_date=ws2.cell(row=i+6,column=3)
        print("threshold slope=",thresholdslope)
        print("i=",i)
        print("i+6=",i+6)
        print("threshold value (i+6)",threshold_value.value)
        print("threshold date=",threshold_date.value)

        break
Quote

Top Page

Possibly Related Threads...
Thread Author Replies Views Last Post
  How to build linear regression by implementing Gradient Descent using only linear alg PythonSpeaker 1 172 Dec-01-2019, 05:35 PM
Last Post: Larz60+
Music Iterate over data and sum Madame32 8 785 Oct-14-2019, 05:54 PM
Last Post: Madame32
  Looking for Eval alternative scionsamurai 2 391 Jul-16-2019, 02:41 AM
Last Post: scionsamurai
  [pandas]How to liner fit time series data and get linear fit equation and r square Sri 5 746 Apr-04-2019, 12:00 PM
Last Post: Sri
  Thoughts on how to approach a project birdieman 10 4,084 Jul-09-2018, 09:02 AM
Last Post: alexblack
  Write data into existing Excel (xlsx) file with multiple sheets BNB 1 10,673 Jun-01-2017, 04:22 PM
Last Post: Larz60+

Forum Jump:


Users browsing this thread: 1 Guest(s)