Python Forum
Dealing with strings thru mmap in Python
Thread Rating:
  • 2 Vote(s) - 2.5 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Dealing with strings thru mmap in Python
#1
For learning purposes, I am trying to create a script to:
-read a list of keywords and store in a array (done)
-list files in a dir tree (done)
-read each file (some big files 10GB+) and search for each keyword in my array (DANG).
so heres a piece of my code:
def search_files(keywords_array,lista):
for i_lista in lista:
    arquivo = open(i_lista,"r")
    str_buffer = mmap.mmap(arquivo.fileno(), 0, access=mmap.ACCESS_READ)
    print(str(str_buffer))
    for i_keywords in keywords_array:
        if str_buffer.find(i_keywords.lower()) != -1:
            print(color.BOLD + "Bingo! : " + color.END + color.RED + i_keywords + color.END  + " : " + i_lista)
everything works fine, UNTIL I need to lowercase the file contents (str_buffer / mmap) to search with my (lowercase) keywords in array.
I just dont know how to proceed (2+ hours of google right now).
ps: I know I can use re.search to make case-insensitive searchs, but I THINK this can consume more resources than lowercase all the files contents.
Thank you!

Wall
Reply
#2
I just read about mmap. If you set length to 0, it means the whole file will be read and stored into the mmap object. You are saying that the file can be more than 10GB. This will take time, the memory could be not enough. You have to read it in chunks. Also, it is said that you cannot create zero length mapping on Windows system. One more thing... Are you running this on x86 PC? The address space is 2**32 which is around 4GB. You can't map more than that.
"As they say in Mexico 'dosvidaniya'. That makes two vidaniyas."
https://freedns.afraid.org
Reply
#3
(Feb-27-2017, 09:22 PM)wavic Wrote: I just read about mmap. If you set length to 0, it means the whole file will be read and stored into the mmap object. You are saying that the file can be more than 10GB. This will take time, the memory could be not enough. You have to read it in chunks.

wavic, thank you for your input.
I really dont know if my usage of mmap is wrong, I get it from a example in stackoverflow. It is working fine right now, but I dont know how to lowercase the buffer. I am dealing with 10GB+ files and it is working.
Reply
#4
Why don't you use re module.
Something like this:

for word in word_list:
    w = re.compile(word, re.I | re.M)
    result = w.match(mm)
    print(word, result)
Someone to corect me pleace since I am not familiar with re module at all.  Blush
"As they say in Mexico 'dosvidaniya'. That makes two vidaniyas."
https://freedns.afraid.org
Reply
#5
Disclaimer: I know almost nothing about mmap

I think that you can convert your mmap content to lowercase this way (atleast for python3 on linux):
str_buffer = mmap.mmap(arquivo.fileno(), 0, access=mmap.ACCESS_COPY)   
# ACCESS_COPY so its possible to change content in memory (file wont be changed)

str_buffer[:] = str_buffer.read().decode('utf8').lower().encode('utf8')
str_buffer.seek(0)   # return to the start of mmap
Probably it would consume a lot of memory, so for huge file you can do it in chunks:
chunksize=1000000
index = 0
chunk = str_buffer.read(chunksize)
while chunk:
     str_buffer[index:index+chunksize] = chunk.decode('utf8').lower().encode('utf8')
     index += chunksize
     chunk = str_buffer.read(chunksize)
With ACCESS_WRITE it would convert file in place (that could be quite dangerous).
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  Excel from SAP - dealing with formats and VBA MasterOfDestr 7 538 Feb-25-2024, 12:23 PM
Last Post: Pedroski55
  Trying to understand strings and lists of strings Konstantin23 2 756 Aug-06-2023, 11:42 AM
Last Post: deanhystad
  UnicodeEncodeError - Dealing with Japanese Characters fioranosnake 2 2,439 Jul-07-2022, 08:43 PM
Last Post: fioranosnake
  Splitting strings in list of strings jesse68 3 1,753 Mar-02-2022, 05:15 PM
Last Post: DeaD_EyE
  Dealing with duplicated data in a CSV file bts001 10 11,425 Sep-06-2021, 12:11 AM
Last Post: SamHobbs
  Dealing with a .json nightmare... ideas? t4keheart 10 4,369 Jan-28-2020, 10:12 PM
Last Post: t4keheart
  Random access binary files with mmap - drastically slows with big files danart 1 3,938 Jun-17-2019, 10:45 AM
Last Post: danart
  Finding multiple strings between the two same strings Slither 1 2,511 Jun-05-2019, 09:02 PM
Last Post: Yoriz
  Dealing with Exponential data parthi1705 11 9,737 May-30-2019, 10:16 AM
Last Post: buran
  Python: if 'X' in 'Y' but with two similar strings as 'X' DreamingInsanity 6 3,852 Feb-01-2019, 01:28 PM
Last Post: buran

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020