Information how pandas works - Printable Version +- Python Forum (https://python-forum.io) +-- Forum: Python Coding (https://python-forum.io/forum-7.html) +--- Forum: Data Science (https://python-forum.io/forum-44.html) +--- Thread: Information how pandas works (/thread-15879.html) Pages:
1
2
|
Information how pandas works - gehrenfeld - Feb-04-2019 Here is my code and everything works fine. There are over 12000 lines in the CSV file. My question is, does pandas read in the entire file before anything below the line runs? Is there any way to speed this script up? I only really use the last 28 days of the file. import pandas as pd import datetime import wx # --------------------------------- read from csv --------------------------- df = pd.read_csv('summary.csv', delimiter='\t', skiprows=3, header=None, names=[ 'id', 'rdate', 'rtype', 'hist', 'scan'], usecols=[0, 1, 2, 3, 4]) # ---------- Fill in 0 --------------------------- df['hist'] = df['hist'].fillna(0) df['scan'] = df['scan'].fillna(0) # -------------- Change dtypes ------------------- df['rdate'] = df['rdate'].astype('datetime64') df['hist'] = df['hist'].astype('int64') df['scan'] = df['scan'].astype('int64') # -------------------------------- move value from scan to hist --------------------------------- df.loc[df['hist'] == 0, 'hist'] = df['scan'] # --------------------------- Select from DataFrame -------------- sd = (df.loc[df['rdate'] > datetime.datetime.now() - pd.to_timedelta("28day")]) sr = (sd.loc[sd['rtype'] == 0]) # ---------------------------- Do calculations ------------------------------- average = round(sr['hist'].mean()) a1c = round((46.7 + sr['hist'].mean() + sr['scan'].mean()) / 28.7, 1) # -----------------Display in wx.Frame ---------------------------------- class Example(wx.Frame): def __init__(self, *args, **kw): super(Example, self).__init__(*args, **kw) self.SetBackgroundColour('#85C1E9') self.InitUI() def InitUI(self): pnl = wx.Panel(self) font = wx.Font(16, wx.FONTFAMILY_SCRIPT, wx.FONTSTYLE_NORMAL, wx.FONTWEIGHT_BOLD) font2 = wx.Font(12, wx.FONTFAMILY_SCRIPT, wx.FONTSTYLE_NORMAL, wx.FONTWEIGHT_NORMAL) font3 = wx.Font(11, wx.FONTFAMILY_SCRIPT, wx.FONTSTYLE_NORMAL, wx.FONTWEIGHT_NORMAL) heading = wx.StaticText(pnl, label='Average Glucose and A1c', pos=(50, 15), size=(200, -1)) heading.SetFont(font) st3 = wx.StaticText(pnl, label=sr.rdate.max().strftime('From: %b %d %y') + ' - ' + str(sr.rdate.min().strftime('To: %b %d %y')), pos=(70, 45)) wx.StaticLine(pnl, pos=(6, 65), size=(328, 1)) st1 = wx.StaticText(pnl, label='Average: ' + str("%.0f" % average), pos=(125, 80)) st2 = wx.StaticText(pnl, label='A1c: ' + str(a1c), pos=(125, 100)) st1.SetFont(font2) st2.SetFont(font2) st3.SetFont(font3) btn = wx.Button(pnl, label='Close', pos=(140, 150)) btn.Bind(wx.EVT_BUTTON, self.OnClose) self.SetSize((360, 230)) self.SetTitle('A1c') self.Centre() def OnClose(self, e): self.Close(True) def main(): app = wx.App() ex = Example(None) ex.Show() app.MainLoop() if __name__ == '__main__': main() RE: Information how pandas works - stullis - Feb-05-2019 The read_csv() function will read the full file, yes. It returns a DataFrame object which has a method called query(), which returns a new DataFrame. With query(), you can abbreviate your DataFrame down to only those rows that satisfy the query expression. Here's a short blog about using dates to filter data in Pandas: https://erikrood.com/Python_References/select_pandas_dataframe_rows_between_two_dates_final.html You'll need a few lines at the top to determine your date range and then cut your DataFrame down to size. RE: Information how pandas works - scidam - Feb-05-2019 I am not sure if this speed up reading the file in your case, but you can try to read last (e.g. 28) lines of it using this approach. Further, you will need to pass these lines to Pandas to construct a data frame. The latter could be done by this way. RE: Information how pandas works - gehrenfeld - Feb-05-2019 Thank you for the suggestions. There are 18 columns in the CSV files so I am only using 4 of them. I should have clarified that. I didn't think of moving line 22 df = (df.loc[df['rdate'] > datetime.datetime.now() - pd.to_timedelta("28day")]) and move it under line 6 that will cut my dataframe down. Something I should have said also is that there are multiple dates that are the same. I measure my glucose 8 to 10 times a day so I can't just read the last 28 with tail(). Thank you for the help. Update - I rearranged the code and removed the sd and sr dataframes and just used df for queries. Seems to run a little faster. import pandas as pd import datetime import wx # --------------------------------- read from csv --------------------------- df = pd.read_csv('summary.csv', delimiter='\t', skiprows=3, header=None, names=[ 'id', 'rdate', 'rtype', 'hist', 'scan'], usecols=[0, 1, 2, 3, 4]) # ---------- Fill in 0 --------------------------- df['hist'] = df['hist'].fillna(0) df['scan'] = df['scan'].fillna(0) # -------------- Change dtypes ------------------- df['rdate'] = df['rdate'].astype('datetime64') df['hist'] = df['hist'].astype('int64') df['scan'] = df['scan'].astype('int64') # --------------------------- Select from DataFrame -------------- df = df.loc[df['rtype'] == 0] df = df.loc[df['rdate'] > datetime.datetime.now() - pd.to_timedelta("14day")] # -------------------------------- move value from scan to hist --------------------------------- df.loc[df['hist'] == 0, 'hist'] = df['scan'] # ---------------------------- Do calculations ------------------------------- average = round(df['hist'].mean()) a1c = round((46.7 + df['hist'].mean() + df['scan'].mean()) / 28.7, 1) print(df) RE: Information how pandas works - scidam - Feb-06-2019 (Feb-05-2019, 12:29 PM)gehrenfeld Wrote: I measure my glucose 8 to 10 times a day so I can't just read the last 28 with tail() I knew about .tail . The .tail method is not a solution. It works when the data is loaded and couldn't beused to speedup loading of a huge file. So, you have 8 to 10 measurements per day. Each measurement is a row in the source file. Therefore, you need to read last 10 * 28 lines. Moreover, 11*28 lines will be definitely sufficient. When these 11*28 lines will be loaded into a data frame, you can do precise filtering using pandas tools. This is general approach. But 12k rows ... this is not too big. RE: Information how pandas works - gehrenfeld - Feb-06-2019 scidam, Thanks for the suggestions. I have added some of the rows in the CSV file. Because the calculation for A1c and for the line plot, I have to have the full last 28 days of readings. The meter itself takes a reading every 15 mins so more than just finger sticks. The file begins on 2019-07-23 and ends at the current date and it grows every day. Currently, there are 12949. The company that makes the meter and software only download the entire file. I have to go through the entire file to get the last 28 days of readings I don't see any way to make it faster than it is. Quote:14183 2019/02/03 00:11 0 88 RE: Information how pandas works - scidam - Feb-07-2019 There are 1440 minutes in a day. The number of rows required for loading is 1440 * 28 / 15 = 2688. For sureness, you can consider 2784 = 1440*29/15 rows. This is not significantly differed from the entire number of rows (13k). So, using .read_csv and .tail is efficient way to load the data. If you had considerably huge file, e.g. measurements accumulated for 10 years or more, you could try to use approach suggested above: seek to the end of a file, count 2784 occurrences of \n -symbol from the end of it, and start file reading from the determined position.
RE: Information how pandas works - gehrenfeld - Feb-07-2019 I have no idea what you're trying to say. What do 1440 minutes have to do with loading in data from a CSV file? The data is from a meter reading blood data for the amount of glucose in the blood. I can't just pick random numbers from the list. I need all of the readings in the last 28 days to calculate the A1c. Because the file was just uploaded there is no way of knowing how many rows where added from the last upload. I don't think tail will work. RE: Information how pandas works - scidam - Feb-07-2019 The main idea was to determine (roughly) the number of rows to be read from the end of the file. Since we know interval of measurements = 15 min, and the number of days = 28, we can expect that last 1440*28/15 rows of the file include measurements for last 28 days. So, we don't need to load entire file. It is sufficient to load last 1440*28/15 rows from it. This is preliminary filtering. From these rows we can build a data frame. Further, we need to do precise filtering, using pandas, as you do in your code. You told about 28 days, but in the following line of code, you take data for 14 days. Is it an error? df = df.loc[df['rdate'] > datetime.datetime.now() - pd.to_timedelta("14day")] RE: Information how pandas works - gehrenfeld - Feb-08-2019 28 days needed for calculation and 14 days for the chart. It can't determine roughly the numbers. I use the calculation to adjust insulin shots, so I need to be as precise as I can. I have to include the meter readings and my finger stick readings (which the number of sticks very every day). Line 23 of the code does that. I now understand what you are telling me!!! That is a good idea for cutting down the number of lines to read. One of the problems I have with using datetime.datetime.now()is it returns the date and current time, so when I run df = df.loc[df['rdate'] > datetime.datetime.now() - pd.to_timedelta("14day")]it uses the time that I run the code. How can I get just the date or the date with the time of 00:00:00, so it will start at midnight. |