Python Forum
help with project of reading and searching big log file
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
help with project of reading and searching big log file
#1
Hello ,
I have a log file that in the end of the day get to ~ 6GB of text
now I want to be able to cut from it a certion windows of time
for example
from 08:00:00 -- until 08:15:00

I have checked and in 15 min I have a around 1.5 milion lines (1,500,000)
when I run the code in the morning , when the log file is less then 1GB - everything is working .

when I run the code in the end of the day (when the log is more then 5GB)
It get stuck , sometime I get on my computer Memory error

and when I try to search another later window (7:00pm-7:20pm ) it can take more then 3 min before it get stuck

my question is
what can I do to make this run better ? faster ?
can pythion handale this amount of data?

this is the function
def FilterLogFile(StartDate, EndDate):

    StartDate = datetime.datetime.strptime(StartDate, '%d/%m/%Y-%H:%M:%S')
    EndDate = datetime.datetime.strptime(EndDate, '%d/%m/%Y-%H:%M:%S')
    EndDate = EndDate.strftime('%d/%m/%Y-%H:%M:%S')
    StartDate = StartDate.strftime('%d/%m/%Y-%H:%M:%S')
    StartDate = str(StartDate)
    EndDate = str(EndDate)
    print(StartDate)
    print(EndDate)
    count = 0
    StartLine = 0
    EndLine = 0
    FullLogFile = open('/home/pi/logs/java.txt', 'r')
    Lines = FullLogFile.readlines()   ###------->>>> this part take to much time when it doens't stuck "Memory Error"
    FullLogFile.close()
    for line in Lines:
        count += 1
        if StartDate in line and StartLine == 0:
            print("Start Line {}: {}".format(count, line.strip()))
            StartLine = count
        if EndDate in line and EndLine == 0:
            print("End Line {}: {}".format(count, line.strip()))
            EndLine = count
        if StartLine != 0 and EndLine != 0:
            break  ## to stop the scan when he get to the wanted end time , no need to scan after the wanted time 

    count = 0
    print('start line is %d , end line is %d' % (StartLine, EndLine))
    print('total number of line is  %d' % (EndLine - StartLine))
    with open(OutputFile, 'w') as f:
        for line in Lines:
            count += 1
            if StartLine <= count <= EndLine:
                f.write(line.strip() + "\r\n")
    
    
    return OutputFile
Thanks,
maybe to read
Reply
#2
There are many things to do to improve the code. First steps:
  • remove the line with .readlines() and don't close the logfile.
  • use for line, count in enumerate(FullLogFile, 1): directly and don't increment count.
  • before the second pass, use FullLogFile.seek(0) to go back to the beginning of the file, then again use for line in FullLogFile:
  • Acually, you could do everything in a single pass.
Reply
#3
up until here I understand
 FullLogFile = open('C:\\Users\\David\\Desktop\\java.txt', 'r')
   # Lines = FullLogFile.readlines()
  #  FullLogFile.close()
    for line, count in enumerate(FullLogFile, 1):
        #count += 1
        if StartDate in line and StartLine == 0:
            print("Start Line {}: {}".format(count, line.strip()))
            StartLine = count
        if EndDate in line and EndLine == 0:
            print("End Line {}: {}".format(count, line.strip()))
            EndLine = count
    count = 0
    print('start line is %d , end line is %d' % (StartLine, EndLine))
    print('total number of line is  %d' % (EndLine - StartLine))
can you explain this line?
before the second pass, use FullLogFile.seek(0) to go back to the beginning of the file, then again use for line in FullLogFile:
coudn't understand what you meant

THank you ,
Reply
#4
NB: the indentation is wrong in the new code that you showed above.

When an open file is read, there is an internal cursor in the file object pointing to the 'current position' in the file, exactly like there is a current page when you are reading a book. When Python reads the next line, it does it from this current position. If you call
FullLogFile.seek(0)
the current position goes back at the beginning of the file and you can start again reading lines from the beginning of the file. This allows you to run the second for loop over the same file.
Reply
#5
OK
now I (think) understand the use of seek

but why is indentation the wrong?

I running the for loop until he find the end time , then he write the cerrnet text into my Output file,no?

this is what I have now (if I understand you correct)
 def FilterLogFile(StartDate, EndDate):

    StartDate = datetime.datetime.strptime(StartDate, '%d/%m/%Y-%H:%M:%S')
    EndDate = datetime.datetime.strptime(EndDate, '%d/%m/%Y-%H:%M:%S')
    EndDate = EndDate.strftime('%d/%m/%Y-%H:%M:%S')
    StartDate = StartDate.strftime('%d/%m/%Y-%H:%M:%S')
    StartDate = str(StartDate)
    EndDate = str(EndDate)
    print(StartDate)
    print(EndDate)
    count = 0
    StartLine = 0
    EndLine = 0

    FullLogFile = open('C:\\Users\\David\\Desktop\\java.txt', 'r')
    for line, count in enumerate(FullLogFile, 1):
        if StartDate in line and StartLine == 0:
            print("Start Line {}: {}".format(count, line.strip()))
            StartLine = count
        if EndDate in line and EndLine == 0:
            print("End Line {}: {}".format(count, line.strip()))
            EndLine = count
         if StartLine != 0 and EndLine != 0:
            break  ## to stop the scan when he get to the wanted end time
    count = 0
    print('start line is %d , end line is %d' % (StartLine, EndLine))
    print('total number of line is  %d' % (EndLine - StartLine))
    FullLogFile.seek(0) ## return to first line in the text file
    with open(OutputFile, 'w') as f:
        for line in FullLogFile:
            count += 1
            if StartLine <= count <= EndLine:
                f.write(line.strip() + "\r\n")
   
    return OutputFile
I get error
argument of type 'int' is not iterable
when he enter the first loop
why line is "int" and not "string"?
Reply
#6
Sorry, it should be for count, line instead of for line, count
Reply
#7
OK - great!
now it's seem to be working faser , I don't get any memory error , the log file is ~ 2.1GB
I will wait until he will be around ~ 5GB and see the result

thank you so much for the help until now
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
Question In need of insight regarding Python file reading mechanisms. EnfantNicolas 7 237 Sep-18-2021, 10:39 AM
Last Post: ndc85430
  Help with reading json file hhchenfx 5 622 Jul-07-2021, 01:58 PM
Last Post: hhchenfx
  [Solved] Reading every nth line into a column from txt file Laplace12 7 784 Jun-29-2021, 09:17 AM
Last Post: Laplace12
  Helps with reading csv file - 3 methods hhchenfx 4 671 May-13-2021, 04:15 AM
Last Post: buran
  Subprocess.Popen() not working when reading file path from csv file herwin 13 1,734 May-07-2021, 03:26 PM
Last Post: herwin
  find the header location in a .bin file without reading the whole file at a time SANJIB 0 576 Mar-05-2021, 04:08 PM
Last Post: SANJIB
  Reading a csv file Led_Zeppelin 2 799 Feb-26-2021, 05:48 AM
Last Post: buran
  reading a csv file Led_Zeppelin 3 994 Feb-19-2021, 02:16 PM
Last Post: Led_Zeppelin
  Code not reading http link from .txt file (Beginner level) plarrip 3 703 Dec-17-2020, 11:33 PM
Last Post: bowlofred
  Reading from a file. krhoades 3 753 Dec-03-2020, 09:52 PM
Last Post: krhoades

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020