Python Forum

For learning purposes, I am trying to create a script to:
-read a list of keywords and store in a array (done)
-list files in a dir tree (done)
-read each file (some big files 10GB+) and search for each keyword in my array (DANG).
so heres a piece of my code:

def search_files(keywords_array,lista):
for i_lista in lista:
    arquivo = open(i_lista,"r")
    str_buffer = mmap.mmap(arquivo.fileno(), 0, access=mmap.ACCESS_READ)
    print(str(str_buffer))
    for i_keywords in keywords_array:
        if str_buffer.find(i_keywords.lower()) != -1:
            print(color.BOLD + "Bingo! : " + color.END + color.RED + i_keywords + color.END  + " : " + i_lista)

everything works fine, UNTIL I need to lowercase the file contents (str_buffer / mmap) to search with my (lowercase) keywords in array.
I just dont know how to proceed (2+ hours of google right now).
ps: I know I can use re.search to make case-insensitive searchs, but I THINK this can consume more resources than lowercase all the files contents.
Thank you!

Wall

I just read about mmap. If you set length to 0, it means the whole file will be read and stored into the mmap object. You are saying that the file can be more than 10GB. This will take time, the memory could be not enough. You have to read it in chunks. Also, it is said that you cannot create zero length mapping on Windows system. One more thing... Are you running this on x86 PC? The address space is 2**32 which is around 4GB. You can't map more than that.

(Feb-27-2017, 09:22 PM)wavic Wrote: [ -> ]I just read about mmap. If you set length to 0, it means the whole file will be read and stored into the mmap object. You are saying that the file can be more than 10GB. This will take time, the memory could be not enough. You have to read it in chunks.

wavic, thank you for your input.
I really dont know if my usage of mmap is wrong, I get it from a example in stackoverflow. It is working fine right now, but I dont know how to lowercase the buffer. I am dealing with 10GB+ files and it is working.

Why don't you use re module.
Something like this:

for word in word_list:
    w = re.compile(word, re.I | re.M)
    result = w.match(mm)
    print(word, result)

Someone to corect me pleace since I am not familiar with re module at all. Blush

Disclaimer: I know almost nothing about mmap

I think that you can convert your mmap content to lowercase this way (atleast for python3 on linux):

str_buffer = mmap.mmap(arquivo.fileno(), 0, access=mmap.ACCESS_COPY)   
# ACCESS_COPY so its possible to change content in memory (file wont be changed)

str_buffer[:] = str_buffer.read().decode('utf8').lower().encode('utf8')
str_buffer.seek(0)   # return to the start of mmap

Probably it would consume a lot of memory, so for huge file you can do it in chunks:

chunksize=1000000
index = 0
chunk = str_buffer.read(chunksize)
while chunk:
     str_buffer[index:index+chunksize] = chunk.decode('utf8').lower().encode('utf8')
     index += chunksize
     chunk = str_buffer.read(chunksize)

With ACCESS_WRITE it would convert file in place (that could be quite dangerous).

doublezero

wavic

doublezero

wavic

zivoni