Python Forum
help with project of reading and searching big log file
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
help with project of reading and searching big log file
#1
Hello ,
I have a log file that in the end of the day get to ~ 6GB of text
now I want to be able to cut from it a certion windows of time
for example
from 08:00:00 -- until 08:15:00

I have checked and in 15 min I have a around 1.5 milion lines (1,500,000)
when I run the code in the morning , when the log file is less then 1GB - everything is working .

when I run the code in the end of the day (when the log is more then 5GB)
It get stuck , sometime I get on my computer Memory error

and when I try to search another later window (7:00pm-7:20pm ) it can take more then 3 min before it get stuck

my question is
what can I do to make this run better ? faster ?
can pythion handale this amount of data?

this is the function
def FilterLogFile(StartDate, EndDate):

    StartDate = datetime.datetime.strptime(StartDate, '%d/%m/%Y-%H:%M:%S')
    EndDate = datetime.datetime.strptime(EndDate, '%d/%m/%Y-%H:%M:%S')
    EndDate = EndDate.strftime('%d/%m/%Y-%H:%M:%S')
    StartDate = StartDate.strftime('%d/%m/%Y-%H:%M:%S')
    StartDate = str(StartDate)
    EndDate = str(EndDate)
    print(StartDate)
    print(EndDate)
    count = 0
    StartLine = 0
    EndLine = 0
    FullLogFile = open('/home/pi/logs/java.txt', 'r')
    Lines = FullLogFile.readlines()   ###------->>>> this part take to much time when it doens't stuck "Memory Error"
    FullLogFile.close()
    for line in Lines:
        count += 1
        if StartDate in line and StartLine == 0:
            print("Start Line {}: {}".format(count, line.strip()))
            StartLine = count
        if EndDate in line and EndLine == 0:
            print("End Line {}: {}".format(count, line.strip()))
            EndLine = count
        if StartLine != 0 and EndLine != 0:
            break  ## to stop the scan when he get to the wanted end time , no need to scan after the wanted time 

    count = 0
    print('start line is %d , end line is %d' % (StartLine, EndLine))
    print('total number of line is  %d' % (EndLine - StartLine))
    with open(OutputFile, 'w') as f:
        for line in Lines:
            count += 1
            if StartLine <= count <= EndLine:
                f.write(line.strip() + "\r\n")
    
    
    return OutputFile
Thanks,
maybe to read
Reply
#2
There are many things to do to improve the code. First steps:
  • remove the line with .readlines() and don't close the logfile.
  • use for line, count in enumerate(FullLogFile, 1): directly and don't increment count.
  • before the second pass, use FullLogFile.seek(0) to go back to the beginning of the file, then again use for line in FullLogFile:
  • Acually, you could do everything in a single pass.
Reply
#3
up until here I understand
 FullLogFile = open('C:\\Users\\David\\Desktop\\java.txt', 'r')
   # Lines = FullLogFile.readlines()
  #  FullLogFile.close()
    for line, count in enumerate(FullLogFile, 1):
        #count += 1
        if StartDate in line and StartLine == 0:
            print("Start Line {}: {}".format(count, line.strip()))
            StartLine = count
        if EndDate in line and EndLine == 0:
            print("End Line {}: {}".format(count, line.strip()))
            EndLine = count
    count = 0
    print('start line is %d , end line is %d' % (StartLine, EndLine))
    print('total number of line is  %d' % (EndLine - StartLine))
can you explain this line?
before the second pass, use FullLogFile.seek(0) to go back to the beginning of the file, then again use for line in FullLogFile:
coudn't understand what you meant

THank you ,
Reply
#4
NB: the indentation is wrong in the new code that you showed above.

When an open file is read, there is an internal cursor in the file object pointing to the 'current position' in the file, exactly like there is a current page when you are reading a book. When Python reads the next line, it does it from this current position. If you call
FullLogFile.seek(0)
the current position goes back at the beginning of the file and you can start again reading lines from the beginning of the file. This allows you to run the second for loop over the same file.
Reply
#5
OK
now I (think) understand the use of seek

but why is indentation the wrong?

I running the for loop until he find the end time , then he write the cerrnet text into my Output file,no?

this is what I have now (if I understand you correct)
 def FilterLogFile(StartDate, EndDate):

    StartDate = datetime.datetime.strptime(StartDate, '%d/%m/%Y-%H:%M:%S')
    EndDate = datetime.datetime.strptime(EndDate, '%d/%m/%Y-%H:%M:%S')
    EndDate = EndDate.strftime('%d/%m/%Y-%H:%M:%S')
    StartDate = StartDate.strftime('%d/%m/%Y-%H:%M:%S')
    StartDate = str(StartDate)
    EndDate = str(EndDate)
    print(StartDate)
    print(EndDate)
    count = 0
    StartLine = 0
    EndLine = 0

    FullLogFile = open('C:\\Users\\David\\Desktop\\java.txt', 'r')
    for line, count in enumerate(FullLogFile, 1):
        if StartDate in line and StartLine == 0:
            print("Start Line {}: {}".format(count, line.strip()))
            StartLine = count
        if EndDate in line and EndLine == 0:
            print("End Line {}: {}".format(count, line.strip()))
            EndLine = count
         if StartLine != 0 and EndLine != 0:
            break  ## to stop the scan when he get to the wanted end time
    count = 0
    print('start line is %d , end line is %d' % (StartLine, EndLine))
    print('total number of line is  %d' % (EndLine - StartLine))
    FullLogFile.seek(0) ## return to first line in the text file
    with open(OutputFile, 'w') as f:
        for line in FullLogFile:
            count += 1
            if StartLine <= count <= EndLine:
                f.write(line.strip() + "\r\n")
   
    return OutputFile
I get error
argument of type 'int' is not iterable
when he enter the first loop
why line is "int" and not "string"?
Reply
#6
Sorry, it should be for count, line instead of for line, count
Reply
#7
OK - great!
now it's seem to be working faser , I don't get any memory error , the log file is ~ 2.1GB
I will wait until he will be around ~ 5GB and see the result

thank you so much for the help until now
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
Sad problems with reading csv file. MassiJames 3 558 Nov-16-2023, 03:41 PM
Last Post: snippsat
  Reading a file name fron a folder on my desktop Fiona 4 851 Aug-23-2023, 11:11 AM
Last Post: Axel_Erfurt
  splitting file into multiple files by searching for string AlphaInc 2 812 Jul-01-2023, 10:35 PM
Last Post: Pedroski55
  Reading data from excel file –> process it >>then write to another excel output file Jennifer_Jone 0 1,046 Mar-14-2023, 07:59 PM
Last Post: Jennifer_Jone
  Reading a file JonWayn 3 1,057 Dec-30-2022, 10:18 AM
Last Post: ibreeden
  Reading Specific Rows In a CSV File finndude 3 940 Dec-13-2022, 03:19 PM
Last Post: finndude
  Web project and running a .py file emont 0 618 Dec-11-2022, 11:15 PM
Last Post: emont
  Excel file reading problem max70990 1 865 Dec-11-2022, 07:00 PM
Last Post: deanhystad
  Replace columns indexes reading a XSLX file Larry1888 2 951 Nov-18-2022, 10:16 PM
Last Post: Pedroski55
  Failing reading a file and cannot exit it... tester_V 8 1,753 Aug-19-2022, 10:27 PM
Last Post: tester_V

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020