Python Forum

Full Version: help with project of reading and searching big log file
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Hello ,
I have a log file that in the end of the day get to ~ 6GB of text
now I want to be able to cut from it a certion windows of time
for example
from 08:00:00 -- until 08:15:00

I have checked and in 15 min I have a around 1.5 milion lines (1,500,000)
when I run the code in the morning , when the log file is less then 1GB - everything is working .

when I run the code in the end of the day (when the log is more then 5GB)
It get stuck , sometime I get on my computer Memory error

and when I try to search another later window (7:00pm-7:20pm ) it can take more then 3 min before it get stuck

my question is
what can I do to make this run better ? faster ?
can pythion handale this amount of data?

this is the function
def FilterLogFile(StartDate, EndDate):

    StartDate = datetime.datetime.strptime(StartDate, '%d/%m/%Y-%H:%M:%S')
    EndDate = datetime.datetime.strptime(EndDate, '%d/%m/%Y-%H:%M:%S')
    EndDate = EndDate.strftime('%d/%m/%Y-%H:%M:%S')
    StartDate = StartDate.strftime('%d/%m/%Y-%H:%M:%S')
    StartDate = str(StartDate)
    EndDate = str(EndDate)
    print(StartDate)
    print(EndDate)
    count = 0
    StartLine = 0
    EndLine = 0
    FullLogFile = open('/home/pi/logs/java.txt', 'r')
    Lines = FullLogFile.readlines()   ###------->>>> this part take to much time when it doens't stuck "Memory Error"
    FullLogFile.close()
    for line in Lines:
        count += 1
        if StartDate in line and StartLine == 0:
            print("Start Line {}: {}".format(count, line.strip()))
            StartLine = count
        if EndDate in line and EndLine == 0:
            print("End Line {}: {}".format(count, line.strip()))
            EndLine = count
        if StartLine != 0 and EndLine != 0:
            break  ## to stop the scan when he get to the wanted end time , no need to scan after the wanted time 

    count = 0
    print('start line is %d , end line is %d' % (StartLine, EndLine))
    print('total number of line is  %d' % (EndLine - StartLine))
    with open(OutputFile, 'w') as f:
        for line in Lines:
            count += 1
            if StartLine <= count <= EndLine:
                f.write(line.strip() + "\r\n")
    
    
    return OutputFile
Thanks,
maybe to read
There are many things to do to improve the code. First steps:
  • remove the line with .readlines() and don't close the logfile.
  • use for line, count in enumerate(FullLogFile, 1): directly and don't increment count.
  • before the second pass, use FullLogFile.seek(0) to go back to the beginning of the file, then again use for line in FullLogFile:
  • Acually, you could do everything in a single pass.
up until here I understand
 FullLogFile = open('C:\\Users\\David\\Desktop\\java.txt', 'r')
   # Lines = FullLogFile.readlines()
  #  FullLogFile.close()
    for line, count in enumerate(FullLogFile, 1):
        #count += 1
        if StartDate in line and StartLine == 0:
            print("Start Line {}: {}".format(count, line.strip()))
            StartLine = count
        if EndDate in line and EndLine == 0:
            print("End Line {}: {}".format(count, line.strip()))
            EndLine = count
    count = 0
    print('start line is %d , end line is %d' % (StartLine, EndLine))
    print('total number of line is  %d' % (EndLine - StartLine))
can you explain this line?
before the second pass, use FullLogFile.seek(0) to go back to the beginning of the file, then again use for line in FullLogFile:
coudn't understand what you meant

THank you ,
NB: the indentation is wrong in the new code that you showed above.

When an open file is read, there is an internal cursor in the file object pointing to the 'current position' in the file, exactly like there is a current page when you are reading a book. When Python reads the next line, it does it from this current position. If you call
FullLogFile.seek(0)
the current position goes back at the beginning of the file and you can start again reading lines from the beginning of the file. This allows you to run the second for loop over the same file.
OK
now I (think) understand the use of seek

but why is indentation the wrong?

I running the for loop until he find the end time , then he write the cerrnet text into my Output file,no?

this is what I have now (if I understand you correct)
 def FilterLogFile(StartDate, EndDate):

    StartDate = datetime.datetime.strptime(StartDate, '%d/%m/%Y-%H:%M:%S')
    EndDate = datetime.datetime.strptime(EndDate, '%d/%m/%Y-%H:%M:%S')
    EndDate = EndDate.strftime('%d/%m/%Y-%H:%M:%S')
    StartDate = StartDate.strftime('%d/%m/%Y-%H:%M:%S')
    StartDate = str(StartDate)
    EndDate = str(EndDate)
    print(StartDate)
    print(EndDate)
    count = 0
    StartLine = 0
    EndLine = 0

    FullLogFile = open('C:\\Users\\David\\Desktop\\java.txt', 'r')
    for line, count in enumerate(FullLogFile, 1):
        if StartDate in line and StartLine == 0:
            print("Start Line {}: {}".format(count, line.strip()))
            StartLine = count
        if EndDate in line and EndLine == 0:
            print("End Line {}: {}".format(count, line.strip()))
            EndLine = count
         if StartLine != 0 and EndLine != 0:
            break  ## to stop the scan when he get to the wanted end time
    count = 0
    print('start line is %d , end line is %d' % (StartLine, EndLine))
    print('total number of line is  %d' % (EndLine - StartLine))
    FullLogFile.seek(0) ## return to first line in the text file
    with open(OutputFile, 'w') as f:
        for line in FullLogFile:
            count += 1
            if StartLine <= count <= EndLine:
                f.write(line.strip() + "\r\n")
   
    return OutputFile
I get error
argument of type 'int' is not iterable
when he enter the first loop
why line is "int" and not "string"?
Sorry, it should be for count, line instead of for line, count
OK - great!
now it's seem to be working faser , I don't get any memory error , the log file is ~ 2.1GB
I will wait until he will be around ~ 5GB and see the result

thank you so much for the help until now