Apr-04-2020, 02:42 PM

Hello everybody,

this is my first entry in a forum. If something is missing, just let me know. If this is not the right place to ask this, please let me know either. Sorry for the long text, but I think it’s helpful to understand my problem.

I’m just working on my masterthesis in information technology, so here is a short description of my task and what I’m looking for, because unfortunately I have not that much experience with statistics or machine learning or data mining. I decided to use python for this task (because is free and open source).

Background information:

I have various diagnostic events (each with an specific ID). Each has one or more times when it goes active and when it goes passive. These data are saved to a csv-file. Each row is a different diagnostic event. The columns stand for consecutive seconds. For me the absolute startpoint is not important at first. But it’s important that each event starts at the same time. The csv-files are of different sizes but you can say they have about 300,000 columns and 50 rows.

Since this number of columns is very large, I first used test data that I created myself (6 rows and about 600 columns). This is a short excerpt (first column is the event ID and the rest show if the event is active (1) or passive (-1) at this time):

Event ID s1 S2 s3 s4 s5 s6 s7 s8 s9 s10 s11 s12 s13 s14 s15 s16 s17

23456 -1 -1 -1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

34567 -1 -1 -1 -1 -1 -1 -1 1 1 1 1 1 1 1 1 1 1

45678 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 1 1 1 1 1

My task:

So my task is to find subsequent error (Folgefehler) relationships between these diagnostic events. So if one event goes active and within about 5 seconds (maybe up to 60 seconds) another event goes active too, the second event depends on the first. This is what I want to know. Later I should also say with which probability both events occur together.

I tried out different methods, but none of them worked for me. So first I tried out to calculate a correlation matrix. I got an result but it doesn’t look the way I expected. Then I tried out the cluster analysis. It seems better but the problem is that the algorithm put events together which shouldn’t be in a cluster. So if a event goes active at second 100 for 10 seconds and another events goes active on second 250,000 for 10 seconds, then the algorithm thinks that they are similar because the most time they are both inactive so the distance is small. But this doesn’t show me any subsequent errors. Then I tried to reduce the dimension of my data with a PCA (principal component analysis). I think that works, but I don’t know how to proceed with this.

I also read about association mining, but I think there could be a problem that only events which go active on the same time are detected. But I look for events with a delay.

Has anybody an idea how to solve my problem to find subsequent error relationships?

this is my first entry in a forum. If something is missing, just let me know. If this is not the right place to ask this, please let me know either. Sorry for the long text, but I think it’s helpful to understand my problem.

I’m just working on my masterthesis in information technology, so here is a short description of my task and what I’m looking for, because unfortunately I have not that much experience with statistics or machine learning or data mining. I decided to use python for this task (because is free and open source).

Background information:

I have various diagnostic events (each with an specific ID). Each has one or more times when it goes active and when it goes passive. These data are saved to a csv-file. Each row is a different diagnostic event. The columns stand for consecutive seconds. For me the absolute startpoint is not important at first. But it’s important that each event starts at the same time. The csv-files are of different sizes but you can say they have about 300,000 columns and 50 rows.

Since this number of columns is very large, I first used test data that I created myself (6 rows and about 600 columns). This is a short excerpt (first column is the event ID and the rest show if the event is active (1) or passive (-1) at this time):

Event ID s1 S2 s3 s4 s5 s6 s7 s8 s9 s10 s11 s12 s13 s14 s15 s16 s17

23456 -1 -1 -1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

34567 -1 -1 -1 -1 -1 -1 -1 1 1 1 1 1 1 1 1 1 1

45678 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 1 1 1 1 1

My task:

So my task is to find subsequent error (Folgefehler) relationships between these diagnostic events. So if one event goes active and within about 5 seconds (maybe up to 60 seconds) another event goes active too, the second event depends on the first. This is what I want to know. Later I should also say with which probability both events occur together.

I tried out different methods, but none of them worked for me. So first I tried out to calculate a correlation matrix. I got an result but it doesn’t look the way I expected. Then I tried out the cluster analysis. It seems better but the problem is that the algorithm put events together which shouldn’t be in a cluster. So if a event goes active at second 100 for 10 seconds and another events goes active on second 250,000 for 10 seconds, then the algorithm thinks that they are similar because the most time they are both inactive so the distance is small. But this doesn’t show me any subsequent errors. Then I tried to reduce the dimension of my data with a PCA (principal component analysis). I think that works, but I don’t know how to proceed with this.

I also read about association mining, but I think there could be a problem that only events which go active on the same time are detected. But I look for events with a delay.

Has anybody an idea how to solve my problem to find subsequent error relationships?