Python Forum - Where's the endless loop?

Pages: 1 2

Hi all,

Here's some code:

#goal here is to sum number of rows by DTE

df = pd.read_csv("C:/Users/Mark/Desktop/SPX_2021_copy.csv")

my_dict = dict(df['DTE'].value_counts())

for key in my_dict:
    if key < 251:
        plt.bar(my_dict.keys(),my_dict.values())

I'm able to print out my_dict.keys() and my_dict.values() so I know they're finite. As I stare at this, I don't see where the hang-up might be?

Mark

IT'S NOT AN INFINITE LOOP. It just takes an extremely long time and I'm not sure why... doing further analysis.

I modified the code to this:

#goal here is to sum number of rows by DTE
import time

start_time = time.time()

df = pd.read_csv("C:/Users/Mark/Desktop/SPX_2021_copy.csv")

my_dict = dict(df['DTE'].value_counts())

for num,key in enumerate(my_dict):
    if num%10 == 0:
        check_time = time.time()
        elap_time = check_time - start_time
        print(f'Row is {num}, elapsed time is {elap_time:.2f}, and projected time is {717/(num+1)*elap_time:.2f}.')
        
    if key < 251:
        plt.bar(my_dict.keys(),my_dict.values())

The .csv file is 307,910 rows by 16 columns. Here are a couple observations.

First, it takes a long time to complete the first if statement: about 70 seconds with the last print saying "Row is 710..." I can't explain the time. It seems to go slowly to a point (e.g. Row 200-400) and then all the rest print at once, showing the same elap_time.

Then, it takes a very long time for the graph to display and when it does, it shows the full range of x-values up to about 1100-1200. It's the same graph as I saw previously when I printed out the dictionary without any limitations (here I tried to restrict only to key < 251). Previously without limitations, it took less than 1 second. Now, it takes an additional 1-2 minutes after the first if block is complete. I can't explain why it shows 251-1200 or why it takes so long.

At no point should Python be looping through the entire .csv multiple times, should it?

Mark

I changed Line 17 to:

        plt.bar(key,my_dict[key])

The whole thing now takes 13 seconds with the graph correctly displaying only keys < 251. So... I see I did Line 17 wrong, but even at that, why does changing this line affect the speed of the first if block?

You are measuring how long it takes the for loop to do 10 plots.

Why are you doing this:

plt.bar(key,my_dict[key])

I assume that plt is mathplotlib.pyplot and that you want to plot df['DTE'] counts for df['DTE'] < 251. What your code does is make a bunch of bar charts where each chart has 1 bar.

If you want counts for df['DTE'] < 251 let pandas do the work for you. In the example below I plot the count of df['DTE'] for DET values in the range 20 to 50.

import random
import matplotlib.pyplot as plt
import pandas as pd
df = pd.DataFrame({'DTE':[random.randint(1,101) for _ in range(1000)]})
counts = df.loc[(df['DTE'] >= 20) & (df['DTE'] <= 50)].value_counts().sort_index(ascending=True)
counts.plot(kind='bar')
plt.show()

(Oct-01-2021, 02:41 PM)Mark17 Wrote: [ -> ]IT'S NOT AN INFINITE LOOP. It just takes an extremely long time and I'm not sure why... doing further analysis.

Did you read my post about about using loops in Pandas,here is the Post again.

Should not be necessary to make dict of df['DTE'].value_counts()
Work directly with the pandas.Series that value_counts() return.
plt.bar should not be inside a loop.
Something like this.

value_counts = df['DTE'].value_counts()
value_counts = value_counts[value_counts <= 251]
value_counts.plot()

I'll have to study both of your responses closely and better understand exactly how they work.

I guess part of me has been thinking the plots accumulate points until they are to be shown at which time all collected points are included. I tried this:

fig=plt.figure()
fig, ax = plt.subplots(2,2)

df = pd.read_csv("C:/Users/Mark/Desktop/SPX_2021_copy.csv")

my_dict = dict(df['DTE'].value_counts())

#for num,key in enumerate(my_dict):
for key in my_dict:
    if key < 251:
        ax[0,0].plot(key,my_dict[key])
    elif (key > 250 and key < 501):
        ax[0,1].plot(key,my_dict[key])
    elif (key > 500 and key < 751):
        ax[1,0].plot(key,my_dict[key])
    elif key > 750:
        ax[1,1].plot(key,my_dict[key])

This output the axes with correct label ranges, but blank graphs themselves with no data plotted.

You can plot a single point at a time, but plotting all the points at once is sooooooo much faster. When you plot a bar at a time matplotlib has to create new axis and new labels and layout the plot area and who knows what else. When you plot all points at once, all the work of laying out the plot is performed once.

When writing Python your goal should be to write as little code as possible. If you find yourself using a for loop you should consider that you might be doing something wrong. If you have multiple if statements it is likely you are doing something wrong. Your last example should not have a for loop or any if statements. You should have pandas selecting the ranges for the different plots, either by selection subsets of the dataframe or subsets of the series (the latter is going to be faster).

import random
import matplotlib.pyplot as plt
import pandas as pd

def plot_range(plot, min_, max_, data):
    '''Essentially plot.bar(data[min_:max_])'''
    data = data[(data.index >= min_) & (data.index <= max_)]
    plot.bar(data.index, data.values)

df = pd.DataFrame({'DTE':[random.randint(1,101) for _ in range(1000)]})
counts = df['DTE'].value_counts().sort_index(ascending=True)
fig, ax = plt.subplots(2,2)
plot_range(ax[0, 0], 1, 25, counts)
plot_range(ax[0, 1], 26, 50, counts)
plot_range(ax[1, 0], 51, 75, counts)
plot_range(ax[1, 1], 76, 100, counts)
plt.show()

(Oct-01-2021, 08:24 PM)deanhystad Wrote: [ -> ]You can plot a single point at a time, but plotting all the points at once is sooooooo much faster. When you plot a bar at a time matplotlib has to create new axis and new labels and layout the plot area and who knows what else. When you plot all points at once, all the work of laying out the plot is performed once.

When writing Python your goal should be to write as little code as possible. If you find yourself using a for loop you should consider that you might be doing something wrong. If you have multiple if statements it is likely you are doing something wrong. Your last example should not have a for loop or any if statements. You should have pandas selecting the ranges for the different plots, either by selection subsets of the dataframe or subsets of the series (the latter is going to be faster).

That's exactly what I'm aiming to do... just don't fully understand it yet. I continue to work on it (including reading links like snippsat provided, which aren't totally sinking in yet but getting better).

Having said all this, then, here are two versions that both work. The first was adapted from my initial attempt and is clearly not where I want to be (for loop, multiple if statements). The second is adapted from your solution (I commented out the sort since output was same without it). Both take roughly 0.75 seconds on my computer. Any thoughts on why yours is not much faster?

%%timeit

import matplotlib.pyplot as plt
import pandas as pd

fig=plt.figure()
fig, ax = plt.subplots(2,2)

df = pd.read_csv("C:/Users/Mark/Desktop/SPX_2021_copy.csv")

my_dict = dict(df['DTE'].value_counts())

dict_250 = {}
dict_501 = {}
dict_751 = {}
dict_2000 = {}

for key in my_dict:
    if key < 251:
        dict_250[key]=my_dict[key]
    elif (key > 250 and key < 501):
        dict_501[key]=my_dict[key]
    elif (key > 500 and key < 751):
        dict_751[key]=my_dict[key]
    elif key > 750:
        dict_2000[key]=my_dict[key]

ax[0,0].bar(dict_250.keys(),dict_250.values())
ax[0,1].bar(dict_501.keys(),dict_501.values())
ax[1,0].bar(dict_751.keys(),dict_751.values())
ax[1,1].bar(dict_2000.keys(),dict_2000.values())

%%timeit 

import matplotlib.pyplot as plt
import pandas as pd

def plot_range(plot, min_, max_, data):
    '''Essentially plot.bar(data[min_:max_])'''
    data = data[(data.index >= min_) & (data.index <= max_)]
    plot.bar(data.index, data.values)

df = pd.read_csv("C:/Users/Mark/Desktop/SPX_2021_copy.csv")
counts = df['DTE'].value_counts()#.sort_index(ascending=True)
fig, ax = plt.subplots(2,2)
plot_range(ax[0, 0], 0, 250, counts)
plot_range(ax[0, 1], 251, 500, counts)
plot_range(ax[1, 0], 501, 750, counts)
plot_range(ax[1, 1], 751, 10000, counts)
#plt.show()

Your code was taking a long time when you added bars to the chart one at a time. Now you plot all the bars at once, just like I am doing. Looping to group the data will be slower than using the dataset functions, but the difference is so minor when compared to the time it takes to plot that it is lost in the noise. I expect looping though your data takes less than a millisecond and using the dataset function might be 100 times faster. No discernable difference. If you were doing analysis on millions of points of data you would notice a difference.

Pages: 1 2