Python Forum

Full Version: python one line file processing
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Hi coders,

I have a file stores alerts, and it only stores alerts generated today, alerts before today have been archived to another files with datastamp.

In this file, one line has one alert. First I need to find alert type A, commands like grep will give me a lot of rows which belongs to type A.
Then I need to find if it has a string named "srcip", if not, I just move on to look a new row, if this row has a string named "srcip", then I need to search string "srcport" and "dstip", and store these three variables.
Now I need to search alerts type B, type B also have a lot of rows, but there is a field called "timestamp", type A's "timestamp" should be a few seconds apart type B's, and if the time apart too much, it's not the same, which shouldn't be correlated.
If A's srcip and srcport and dstip is same with type B's, then it's a bingo, and I need to extract "dstport" from type B alert.

The main question I don't know is how to know which rows have already been processed, and only search for the new rows?
Your question isn't very clear. Say you find A(n), a type A error. You want to find B(n), the matching type B error. Is the file such that B(n) is going to appear in the file after A(n) but before A(n+1), the next type A error? If so, this is easy: Keep track of the last type A you found, and check it against any type B's you find.

If that is not the case, you need to keep track of all the type A's you find (that match your other criteria, of course). I would put them in a list, probably ordered by timestamp. If there's a match, put the match in the output, and remove the matching type A.

Depending on the data, I might use a dictionary. The key would a tuple of (scrip, srcport, dstip), the list would be a list of matching type A errors. Then for a given type B, you could find all of the potentially matching type A's, and check their time stamps.
Aha, my main question I don't know is how to know which rows have already been processed, and only search for the new rows? Does python have some library about this?
You can process the first time the whole file and use after the iteration of lines the method tell of of the file object, which tells you where you are (at which byte). You can convert the integer to a str and write it to a file. Next time the script looks for this file and if the file is present, it should load the content of the file, convert it back to an int and you use before you start iterating over the lines, you use seek(position) on the file object. Then you have the position, where your script finished last time.

In [20]: with open('birds.txt') as fd: 
    ...:     for line in fd: 
    ...:         print(line.strip()) 
    ...:     print(fd.tell()) 
    ...: #fd.tell() <- file is already closed 
    ...:                                                                                                                                                                                    
2010-01-01 01:01:00.0000 left
2010-11-01 01:01:00.0000 right
2010-10-01 01:01:00.0000 right
91
So, if a program writes now to birds.txt, it starts as byte position 91.
Amazing! Thank you DeaD_EyE, this is what I need.