Information how pandas works - Printable Version

Information how pandas works - Printable Version

+- Python Forum (https://python-forum.io)
+-- Forum: Python Coding (https://python-forum.io/forum-7.html)
+--- Forum: Data Science (https://python-forum.io/forum-44.html)
+--- Thread: Information how pandas works (/thread-15879.html)

Pages: 1 2

Information how pandas works - gehrenfeld - Feb-04-2019

Here is my code and everything works fine. There are over 12000 lines in the CSV file.

My question is, does pandas read in the entire file before anything below the line runs?
Is there any way to speed this script up? I only really use the last 28 days of the file.

import pandas as pd
import datetime
import wx

# --------------------------------- read from csv ---------------------------
df = pd.read_csv('summary.csv', delimiter='\t', skiprows=3, header=None, names=[
    'id', 'rdate', 'rtype', 'hist', 'scan'], usecols=[0, 1, 2, 3, 4])

# ---------- Fill in 0 ---------------------------
df['hist'] = df['hist'].fillna(0)
df['scan'] = df['scan'].fillna(0)

# -------------- Change dtypes -------------------
df['rdate'] = df['rdate'].astype('datetime64')
df['hist'] = df['hist'].astype('int64')
df['scan'] = df['scan'].astype('int64')

# -------------------------------- move value from scan to hist ---------------------------------
df.loc[df['hist'] == 0, 'hist'] = df['scan']

# --------------------------- Select from DataFrame --------------
sd = (df.loc[df['rdate'] > datetime.datetime.now() - pd.to_timedelta("28day")])
sr = (sd.loc[sd['rtype'] == 0])

# ---------------------------- Do calculations -------------------------------
average = round(sr['hist'].mean())
a1c = round((46.7 + sr['hist'].mean() + sr['scan'].mean()) / 28.7, 1)

# -----------------Display in wx.Frame ----------------------------------

class Example(wx.Frame):

    def __init__(self, *args, **kw):
        super(Example, self).__init__(*args, **kw)

        self.SetBackgroundColour('#85C1E9')

        self.InitUI()

    def InitUI(self):
        pnl = wx.Panel(self)

        font = wx.Font(16, wx.FONTFAMILY_SCRIPT,
                       wx.FONTSTYLE_NORMAL, wx.FONTWEIGHT_BOLD)
        font2 = wx.Font(12, wx.FONTFAMILY_SCRIPT,
                        wx.FONTSTYLE_NORMAL, wx.FONTWEIGHT_NORMAL)
        font3 = wx.Font(11, wx.FONTFAMILY_SCRIPT,
                        wx.FONTSTYLE_NORMAL, wx.FONTWEIGHT_NORMAL)

        heading = wx.StaticText(pnl, label='Average Glucose and A1c',
                                pos=(50, 15), size=(200, -1))
        heading.SetFont(font)

        st3 = wx.StaticText(pnl, label=sr.rdate.max().strftime('From: %b %d %y') + '  -  ' + str(sr.rdate.min().strftime('To: %b %d %y')),
                            pos=(70, 45))

        wx.StaticLine(pnl, pos=(6, 65), size=(328, 1))

        st1 = wx.StaticText(pnl, label='Average: ' +
                                       str("%.0f" % average), pos=(125, 80))
        st2 = wx.StaticText(pnl, label='A1c:          ' + str(a1c), pos=(125, 100))

        st1.SetFont(font2)
        st2.SetFont(font2)
        st3.SetFont(font3)

        btn = wx.Button(pnl, label='Close', pos=(140, 150))
        btn.Bind(wx.EVT_BUTTON, self.OnClose)

        self.SetSize((360, 230))
        self.SetTitle('A1c')
        self.Centre()

    def OnClose(self, e):
        self.Close(True)


def main():
    app = wx.App()
    ex = Example(None)
    ex.Show()
    app.MainLoop()


if __name__ == '__main__':
    main()

RE: Information how pandas works - stullis - Feb-05-2019

The read_csv() function will read the full file, yes. It returns a DataFrame object which has a method called query(), which returns a new DataFrame. With query(), you can abbreviate your DataFrame down to only those rows that satisfy the query expression.

Here's a short blog about using dates to filter data in Pandas: https://erikrood.com/Python_References/select_pandas_dataframe_rows_between_two_dates_final.html

You'll need a few lines at the top to determine your date range and then cut your DataFrame down to size.

RE: Information how pandas works - scidam - Feb-05-2019

I am not sure if this speed up reading the file in your case, but you can try to read last (e.g. 28) lines
of it using this approach. Further, you will need to pass these lines to Pandas to construct a data frame. The latter could be
done by this way.

RE: Information how pandas works - gehrenfeld - Feb-05-2019

Thank you for the suggestions.
There are 18 columns in the CSV files so I am only using 4 of them. I should have clarified that.

I didn't think of moving line 22 df = (df.loc[df['rdate'] > datetime.datetime.now() - pd.to_timedelta("28day")]) and move it under line 6 that will cut my dataframe down.

Something I should have said also is that there are multiple dates that are the same. I measure my glucose 8 to 10 times a day so I can't just read the last 28 with tail().

Thank you for the help.

Update - I rearranged the code and removed the sd and sr dataframes and just used df for queries.
Seems to run a little faster.

import pandas as pd
import datetime
import wx

# --------------------------------- read from csv ---------------------------
df = pd.read_csv('summary.csv', delimiter='\t', skiprows=3, header=None, names=[
    'id', 'rdate', 'rtype', 'hist', 'scan'], usecols=[0, 1, 2, 3, 4])

# ---------- Fill in 0 ---------------------------
df['hist'] = df['hist'].fillna(0)
df['scan'] = df['scan'].fillna(0)

# -------------- Change dtypes -------------------
df['rdate'] = df['rdate'].astype('datetime64')
df['hist'] = df['hist'].astype('int64')
df['scan'] = df['scan'].astype('int64')

# --------------------------- Select from DataFrame --------------
df = df.loc[df['rtype'] == 0]
df = df.loc[df['rdate'] > datetime.datetime.now() - pd.to_timedelta("14day")]

# -------------------------------- move value from scan to hist ---------------------------------
df.loc[df['hist'] == 0, 'hist'] = df['scan']

# ---------------------------- Do calculations -------------------------------
average = round(df['hist'].mean())
a1c = round((46.7 + df['hist'].mean() + df['scan'].mean()) / 28.7, 1)

print(df)

RE: Information how pandas works - scidam - Feb-06-2019

(Feb-05-2019, 12:29 PM)gehrenfeld Wrote: I measure my glucose 8 to 10 times a day so I can't just read the last 28 with tail()

I knew about .tail. The .tail method is not a solution. It works when the data is loaded and couldn't be
used to speedup loading of a huge file. So, you have 8 to 10 measurements per day. Each measurement is a row in the source file. Therefore, you need to read last 10 * 28 lines. Moreover, 11*28 lines will be
definitely sufficient.
When these 11*28 lines will be loaded into a data frame, you can do precise
filtering using pandas tools.
This is general approach. But 12k rows ... this is not too big.

RE: Information how pandas works - gehrenfeld - Feb-06-2019

scidam,

Thanks for the suggestions. I have added some of the rows in the CSV file. Because the calculation for A1c and for the line plot, I have to have the full last 28 days of readings.

The meter itself takes a reading every 15 mins so more than just finger sticks.
The file begins on 2019-07-23 and ends at the current date and it grows every day. Currently, there are 12949.

The company that makes the meter and software only download the entire file. I have to go through the entire file to get the last 28 days of readings

I don't see any way to make it faster than it is.

Quote:14183 2019/02/03 00:11 0 88
14184 2019/02/03 00:26 0 94
14185 2019/02/03 00:41 0 97
14186 2019/02/03 00:56 0 95
14187 2019/02/03 01:11 0 95
14188 2019/02/03 01:26 0 97
14189 2019/02/03 01:41 0 100
14190 2019/02/03 01:56 0 97
14191 2019/02/03 02:11 0 91
14192 2019/02/03 02:26 0 92
14193 2019/02/03 02:41 0 94
14194 2019/02/03 03:00 1 100
14196 2019/02/03 02:56 0 95
14197 2019/02/03 03:11 0 97
14198 2019/02/03 03:26 0 103
14199 2019/02/03 03:41 0 108
14200 2019/02/03 03:56 0 111
14201 2019/02/03 04:11 0 112
14202 2019/02/03 04:26 0 114
14203 2019/02/03 04:41 0 113
14204 2019/02/03 04:56 0 115
14205 2019/02/03 05:11 0 118
14206 2019/02/03 05:26 0 119
14207 2019/02/03 05:41 0 121
14208 2019/02/03 06:01 4 50.0
14208 2019/02/03 06:01 1 120
14210 2019/02/03 05:56 0 123
14211 2019/02/03 06:11 0 122
14212 2019/02/03 06:26 0 119
14213 2019/02/03 06:41 0 113
14214 2019/02/03 06:56 0 108
14215 2019/02/03 07:11 0 106
14216 2019/02/03 07:26 0 104
14217 2019/02/03 07:41 0 101
14218 2019/02/03 07:56 0 99
14219 2019/02/03 08:20 1 107 Food2
14221 2019/02/03 08:11 0 100
14222 2019/02/03 08:26 0 109
14223 2019/02/03 08:41 0 141
14224 2019/02/03 08:56 0 174
14225 2019/02/03 09:11 0 177
14226 2019/02/03 09:26 0 168
14227 2019/02/03 09:41 0 155
14228 2019/02/03 09:56 0 148
14229 2019/02/03 10:11 0 149
14230 2019/02/03 10:26 0 149
14231 2019/02/03 10:41 0 147
14232 2019/02/03 10:56 0 149
14233 2019/02/03 11:16 1 153 3 hour after
14235 2019/02/03 11:11 0 150
14236 2019/02/03 11:26 0 148
14237 2019/02/03 11:41 0 155
14238 2019/02/03 11:56 0 160
14239 2019/02/03 12:11 0 157
14240 2019/02/03 12:26 0 152
14241 2019/02/03 12:41 0 151
14242 2019/02/03 12:56 0 163
14243 2019/02/03 13:11 0 190
14244 2019/02/03 13:26 0 209
14245 2019/02/03 13:41 0 211
14246 2019/02/03 13:56 0 210
14247 2019/02/03 14:11 0 213
14248 2019/02/03 14:26 0 212
14249 2019/02/03 14:41 0 208
14250 2019/02/03 14:56 0 202
14251 2019/02/03 15:11 0 191
14252 2019/02/03 15:26 0 179
14253 2019/02/03 15:49 1 168
14255 2019/02/03 20:25 0 156
14256 2019/02/03 20:40 0 155
14257 2019/02/03 20:55 0 157
14258 2019/02/03 21:10 0 157
14259 2019/02/03 21:25 0 151
14260 2019/02/03 21:40 0 155
14261 2019/02/03 21:55 0 169
14262 2019/02/03 22:10 0 176
14263 2019/02/03 22:25 0 164
14264 2019/02/03 22:40 0 157
14265 2019/02/03 22:55 0 156
14266 2019/02/03 23:10 0 162
14267 2019/02/03 23:25 0 171
14268 2019/02/03 23:40 0 172
14269 2019/02/03 23:55 0 170

RE: Information how pandas works - scidam - Feb-07-2019

There are 1440 minutes in a day. The number of rows required for loading is 1440 * 28 / 15 = 2688. For sureness, you can consider 2784 = 1440*29/15 rows. This is not significantly differed from the entire number of rows (13k). So, using .read_csv and .tail is efficient way to load the data.
If you had considerably huge file, e.g. measurements accumulated for 10 years or more, you could try
to use approach suggested above: seek to the end of a file, count 2784 occurrences of \n-symbol from the end of it, and start file reading from the determined position.

RE: Information how pandas works - gehrenfeld - Feb-07-2019

I have no idea what you're trying to say.

What do 1440 minutes have to do with loading in data from a CSV file?

The data is from a meter reading blood data for the amount of glucose in the blood. I can't just pick random numbers from the list. I need all of the readings in the last 28 days to calculate the A1c. Because the file was just uploaded there is no way of knowing how many rows where added from the last upload.

I don't think tail will work.

RE: Information how pandas works - scidam - Feb-07-2019

The main idea was to determine (roughly) the number of rows to be read from the end of the file.
Since we know interval of measurements = 15 min, and the number of days = 28, we can expect that
last 1440*28/15 rows of the file include measurements for last 28 days. So, we don't need to load entire
file. It is sufficient to load last 1440*28/15 rows from it. This is preliminary filtering. From
these rows we can build a data frame. Further, we need to do precise filtering, using pandas, as you do in your code.

You told about 28 days, but in the following line of code, you take data for 14 days. Is it an error?

df = df.loc[df['rdate'] > datetime.datetime.now() - pd.to_timedelta("14day")]

RE: Information how pandas works - gehrenfeld - Feb-08-2019

28 days needed for calculation and 14 days for the chart.

It can't determine roughly the numbers. I use the calculation to adjust insulin shots, so I need to be as precise as I can.

I have to include the meter readings and my finger stick readings (which the number of sticks very every day). Line 23 of the code does that.

I now understand what you are telling me!!!

That is a good idea for cutting down the number of lines to read.

One of the problems I have with using

datetime.datetime.now()

is it returns the date and current time, so when I run

df = df.loc[df['rdate'] > datetime.datetime.now() - pd.to_timedelta("14day")]

it uses the time that I run the code.

How can I get just the date or the date with the time of 00:00:00, so it will start at midnight.