Python Forum
Insert missing data in a dataframe
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Insert missing data in a dataframe
#1
Hello question about a time series again xd

I have a long time series, where I have data of rain intensity with a deltatime of 2 minutes, it does however not fill out the time where no rain was detected. I need both as I use the data to be able to calculate the amount of water stored in a bassin, and the period where the amount of rain is equal to zero is therefore as important. I am not sure how to attack this. One idea I had was to see if the datetime before was different from the next datetime+ a timestep of two minutes:
[ Time ] [intensity]
1997-07-01 12:22:00 , 1
1997-07-01 12:28:00 , 1
1997-07-01 12:30:00 , 1
as the difference between the first time and the next is larger than 2 the output should be:
[ Time ] [intensity]
1997-07-01 12:22:00 , 1
1997-07-01 12:24:00 , 0 (inserted timestep)
1997-07-01 12:26:00 , 0 (inserted timestep)
1997-07-01 12:28:00 , 1
1997-07-01 12:30:00 , 1
I cant however not figure out how to do this in practice..
Another way I thought I could do it would be to construct a time series of the same length and then insert zero intensities, and merch the two time series
The problem with that is however that the data is registered at even and uneven times, and this method would result in cases where the delta time would become 1 minute instead of two minutes:

[ Time ] [intensity]
1997-07-01 12:22:00 , 1
1997-07-01 12:24:00 , 0 (inserted timestep)
1997-07-01 12:26:00 , 0 (inserted timestep)
1997-07-01 12:28:00 , 1
1997-07-01 12:30:00 , 0 (inserted timestep)
1997-07-01 12:31:00 , 1
1997-07-01 12:32:00 , 0 (inserted timestep)
1997-07-01 12:33:00 , 1
1997-07-01 12:34:00 , 0 (inserted timestep)
1997-07-01 12:35:00 , 1

As it can be seen here, the times inserted from 12:32 should not be inserted as the timestep should be 2 minutes.
If anyone has a way to approach this, it would be greatly appreciated
Looking forward to hearing from you!

Attached Files

.csv   example.csv (Size: 22.66 KB / Downloads: 206)
Reply
#2
Perhaps I am not the right person to answer your question. Perhaps better tools can be found in pandas, but if everything else fails, you could solve it with elementary python. To calculate with time you need the datetime module with datetime and timedelta. Datetime stores a timestamp in an object. With timedelta you can calculate. So you can make a function to convert a datetime string to a datetimestring two minutes later. Like this.
from datetime import datetime, timedelta

dtformat = "%Y-%m-%d %H:%M:%S"
twominutes = timedelta(minutes=2)

def computenexttime(dtstring: str) -> str:
    """ Input is a string in datetime format, output is the next time in string format. """
    # Make datetime object.
    datetimeobj = datetime.strptime(dtstring, dtformat)
    # Compute next datetime.
    nextdtobj = datetimeobj + twominutes
    # Return next datetime as a string.
    return nextdtobj.strftime(dtformat)
With this function you can check if a line is two minutes later than the previous. Is this enough to handle the problem? Please show us what you made of it.
Reply
#3
No code is posted, so neither do I, only provide idea: if time is index, then create new dataframe with step on 2 minutes for needed duration and value of 0. Then just add two columns, those rows which are only in new dataframe will keep value 0 and those which have value 1 in initial dataframe will have that value (1 + 0 = 1).
I'm not 'in'-sane. Indeed, I am so far 'out' of sane that you appear a tiny blip on the distant coast of sanity. Bucky Katt, Get Fuzzy

Da Bishop: There's a dead bishop on the landing. I don't know who keeps bringing them in here. ....but society is to blame.
Reply
#4
Thanks for the replies
it is not that i am lazy, im just retarded
But the code so far is

import datetime
import os
from datetime import timedelta
from pathlib import Path

import pandas as pd


# dt=datetime.timedelta(minutes=2)
def convert_file():
    # Set start directory same as script
    os.chdir(os.path.abspath(os.path.dirname(__file__)))

    infile = Path('.') / 'gauga20211_19790101-20120101.km2'
    outfile = Path('.') / 'newfile.csv'

    with infile.open() as fp, outfile.open('w') as fout:
        startdate = None
        # starttime = None
        nexttime = 0
        dt = 0
        for line in fp:
            line = line.strip().split()
            # extract header
            if line[0] == '1':
                startdate = pd.to_datetime(line[1]+line[2], format='%Y%m%d%H%M')
                nexttime = startdate + datetime.timedelta(minutes=2)
            else:
                for item in line:
                    # Slet dt og heae
                    data = f"{nexttime},{item}\n"
                    fout.write(data)
                    nexttime += timedelta(minutes=2)


if __name__ == '__main__':
    convert_file()

data = pd.read_csv('newfile.csv')

data.to_csv('newfile1.csv', header=['time', 'intensity'], index=False, sep=',')
As the original file is constructed as
1 19790111 1007 20211 43 1 2.8 Header with start time for rain event
3.333 3.333 3.333 0.556 0.556 0.556 0.556 0.556 0.556 3.333 rain intensities measured with a interval of two minutes
0.370 0.370 0.370 0.370 0.370 0.370 0.370 0.370 0.370 0.476
0.476 0.476 0.476 0.476 0.476 0.476 6.667 1.667 1.667 6.667
0.667 0.667 0.667 0.667 0.667 0.417 0.417 0.417 0.417 0.417
0.417 0.417 0.417
1 19790125 1208 20211 30 1 3.0 1 Header with start time for new rain event
3.333 0.833 0.833 0.833 0.833 3.333 1.667 1.667 1.667 1.667
1.111 1.111 1.111 1.667 1.667 0.833 0.833 0.833 4.167 3.333
3.333 3.333 3.333 1.111 1.111 1.111 0.833 0.833 0.833 0.833

Larz60+ was a great help with that script!
But the problem still remains, that I need a way to insert the times where it does not rain. My thought was then the create a file with a time interval of 1 minute, and insert rain intensities of 0:
df1 = pd.read_csv('newfile1.csv')
d=df1.time
timerange= pd.date_range(first_line, periods=minutes, freq='1min')
df2=pd.DataFrame()
df2['time']= timerange
df2['intensity']='0'
df2.to_csv('zeroserie.csv', sep=',', index=False, header=['time', 'intensity'])
This script gives me:
Output:
time,intensity 1979-03-23 09:11:00,0 1979-03-23 09:12:00,0 1979-03-23 09:13:00,0 1979-03-23 09:14:00,0 .......
I then want to merge this file with the old file, and remove duplicates:
df3=pd.concat([df1,df2]).drop_duplicates().reset_index(drop=True)

df3.to_csv('Newfile25.csv',sep=',', index=False, header=['time', 'intensity'])
This does however not work as intended...
The next step would then be to delete lines with rain intensities equal to 0 if these were within a time of 2< minutes of rain intensities >0

As the time can be both even and uneven, im not sure how to do it.. :/

Attached Files

.txt   gauga20211_19790101-20120101.txt (Size: 1.03 KB / Downloads: 78)
Reply
#5
(Jan-17-2022, 11:00 AM)perfringo Wrote: No code is posted, so neither do I, only provide idea: if time is index, then create new dataframe with step on 2 minutes for needed duration and value of 0. Then just add two columns, those rows which are only in new dataframe will keep value 0 and those which have value 1 in initial dataframe will have that value (1 + 0 = 1).

Implementation of idea above (I have little idea whether it address actual problem):

Creating sample data - three datetimes as indices with value of 1:

import pandas as pd

data = dict.fromkeys(('2006-09-17 12:49:00', '2006-09-17 12:57:00', '2006-09-17 13:01:00'), 1)
df = pd.DataFrame.from_dict(data, orient='index')
print(df)
Output:
0 2006-09-17 12:49:00 1 2006-09-17 12:57:00 1 2006-09-17 13:01:00 1
Create dataframe with all needed datetimes with interval of 2 minutes and values of 0:

import numpy as np

index = pd.date_range('2006-09-17 12:45:00', periods=10, freq='2T')
df_2 = pd.DataFrame(np.zeros(10), index=index)
print(df_2)
Output:
0 2006-09-17 12:45:00 0 2006-09-17 12:47:00 0 2006-09-17 12:49:00 0 2006-09-17 12:51:00 0 2006-09-17 12:53:00 0 2006-09-17 12:55:00 0 2006-09-17 12:57:00 0 2006-09-17 12:59:00 0 2006-09-17 13:01:00 0 2006-09-17 13:03:00 0
Now I just add these two dataframes. Addition will be indices based, just fill NaN-s with zeros:

new = (df_2 + df).fillna(0).astype(int)
print(new)
Output:
0 2006-09-17 12:45:00 0 2006-09-17 12:47:00 0 2006-09-17 12:49:00 1 2006-09-17 12:51:00 0 2006-09-17 12:53:00 0 2006-09-17 12:55:00 0 2006-09-17 12:57:00 1 2006-09-17 12:59:00 0 2006-09-17 13:01:00 1 2006-09-17 13:03:00 0
Now I have dataframe which have two minutes interval and have value 1 from source data and 0 where particular datetime missing in source data.
amdi40 likes this post
I'm not 'in'-sane. Indeed, I am so far 'out' of sane that you appear a tiny blip on the distant coast of sanity. Bucky Katt, Get Fuzzy

Da Bishop: There's a dead bishop on the landing. I don't know who keeps bringing them in here. ....but society is to blame.
Reply
#6
Thank you for the answer!
This would work if the start time of a rain event always were an uneven number, the problem is that it can start at both even and uneven times, which is why I was not able to merge a file with 0 intensities, as that would result in
Output:
start of rain event 1 2006-09-17 12:45:00 1 2006-09-17 12:47:00 1 2006-09-17 12:49:00 1 2006-09-17 12:51:00 1 2006-09-17 12:53:00 0 Start of rain event 2 2006-09-17 12:44:00 1 2006-09-17 12:45:00 0 2006-09-17 12:46:00 1 2006-09-17 12:47:00 0 2006-09-17 12:48:00 1 2006-09-17 12:49:00 0 2006-09-17 12:50:00 1 2006-09-17 12:51:00 0
as it can be seen, nan values would be inserted in the wrong places too Cry
Reply
#7
(Jan-19-2022, 07:47 AM)amdi40 Wrote: as it can be seen, nan values would be inserted in the wrong places too

Nan values will be inserted if indices are not in both dataframes. In order to tackle the problem indices in second dataframe (with zero values) must be created differently so that (1) all datetimes in source data are present in second dataframe (2) create indices of second dataframe so that two minutes interval resets based on value on previous and next row in source data
amdi40 likes this post
I'm not 'in'-sane. Indeed, I am so far 'out' of sane that you appear a tiny blip on the distant coast of sanity. Bucky Katt, Get Fuzzy

Da Bishop: There's a dead bishop on the landing. I don't know who keeps bringing them in here. ....but society is to blame.
Reply
#8
(Jan-19-2022, 08:51 AM)perfringo Wrote:
(Jan-19-2022, 07:47 AM)amdi40 Wrote: as it can be seen, nan values would be inserted in the wrong places too

Nan values will be inserted if indices are not in both dataframes. In order to tackle the problem indices in second dataframe (with zero values) must be created differently so that (1) all datetimes in source data are present in second dataframe (2) create indices of second dataframe so that two minutes interval resets based on value on previous and next row in source data

Thank you for the answer again! i will try to see if i can figure out how to do that :)
Reply


Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020