encyption help slow perfomance

python_newbie · Dec-09-2016, 10:48 AM

Hi,

Firstly thanks for reading. I am not an python coder but I have been ask to encrypt some emails and I have incredible slow performance. I have googled the parts to get the script going but I am estimating a week before the encryption is finished. can some help me out

pii_email.txt has 11 million email address

import binascii, hashlib
salt = 'xxxxxxxxx'


print "Opening the file..."
target = open('HASHED_EMAILS.txt', 'w')
print "Truncating the file.  Goodbye!"
target.truncate()
with open('pii_emails.txt') as f:
    for line in f:
#        print(line.rstrip('\n'))
#        print('x')
#        print(binascii.hexlify(hashlib.pbkdf2_hmac('sha256', line.rstrip('\n'), salt, 100000)))
        target.write(binascii.hexlify(hashlib.pbkdf2_hmac('sha256', line.rstrip('\n'), salt, 100000)))
        target.write("\n")

print "And finally, we close it."
target.close()

**nilamo** · (This post was last modified: Dec-09-2016, 07:59 PM by nilamo.)

Is it slow without the encryption?
Instead of binascii.hexlify, why not just call hex() on the hashlib object? This could just be a style thing, but I get the feeling for a file so large, we might need to do some... interesting things to get a speedup).

I guess maybe another question is what your goal is. sha256 is a one way encryption, so you'll have a file full of things you can never decrypt. Which doesn't seem to serve an actual purpose to me, unless each line of the source file is a different email, and you're generating email hashes so you can later make sure they were received by the client correctly by comparing hashes, or something.

***stranac*** · Dec-10-2016, 04:15 AM

So you have 11 million emails, and you do 100 thousand iterations on each.
How could that not be slow?

**nilamo** · Dec-10-2016, 07:07 PM

This isn't a "python is slow" thing. It's not that io is slow. It isn't even that your file is big.
The problem is that sha is *designed* to be slow.

As a demonstration, here's a script that encrypts itself, ie only 30ish*2*10 (that's 600ish) lines, and it STILL takes a while to run.

import binascii, hashlib, fileinput
salt = b'xxxxxxxxx'
 
# original from OP, minor modifications
def version1(fin, fout):
    # opening as write auto-truncates
    with open(fout, 'w') as target:
        with open(fin) as f:
            for line in f:
                target.write(binascii.hexlify(hashlib.pbkdf2_hmac('sha256', line.rstrip('\n').encode(), salt, 100000)).decode())
                target.write("\n")

# my... updated version.
def version2(fin, fout):
    with open(fin) as f:
        with open(fout, 'w') as out_file:
            # avoid unravelling scope
            hmac = hashlib.pbkdf2_hmac
            local_salt = salt
            for line in f:
                print(hmac('sha256', line.encode(), local_salt, 100000).hex(), file=out_file)
            
if __name__ == '__main__':
    import sys, timeit
    fin = sys.argv[0]
    fout = "output.txt"
    print("Version 1 starting...")
    print("Time: ", timeit.timeit("version1('{0}', '{1}')".format(fin, fout), number=10, globals=globals()))
    print("Version 2 starting...")
    print("Time: ", timeit.timeit("version2('{0}', '{1}')".format(fin, fout), number=10, globals=globals()))

(more than 10 iterations would start showing my version to be faster, but my patience wears thin at 30ish seconds each. The point is that they're CloseEnough[tm], and that the RealAnswer[tm] isn't optimizing the python code at all)

Output:python test.py
Version 1 starting...
Time:  32.47912279017546
Version 2 starting...
Time:  32.06627073627172

If you actually do need hashes, instead of encryption (because what you're doing now isn't encryption), and you actually need a hash of every line, instead of just a hash of the entire file, then what you might want to consider is breaking the file into chunks and processing each chunk separately (ie: on different processors/cores), and then combining each output into a single master list. That way at least you can do more than one hash at once.

python_newbie · Dec-12-2016, 10:55 AM

@nilamo thanks for this. yes I did think about splitting the file

Apologies I realise this is not encryption. I have no choice but to hash the emails as I need to compare them to a third party emails which have also been hashed and report any matches.

**Larz60+** · Dec-12-2016, 03:59 PM

You should hash the keys (whatever you want that to be), not complete emails

**nilamo** · Dec-12-2016, 04:22 PM

I don't think that matters, tbh. Hashing millions of *anything* is going to take a long time.

**Larz60+** · (This post was last modified: Dec-12-2016, 07:46 PM by Larz60+.)

Quote:I don't think that matters, tbh. Hashing millions of *anything* is going to take a long time.

First, fast is relative,

Hashing individual keys can be extremely fast.

Case in point - My background is telecommunications (At one of Largest telecommunications companies
in the world at the time)

We processed no less than 80 million calls per day, each starting out as a bit stream,
Then hashing, formatting into records, identifying customer, breaking into time of day and date segments,
(billing was done on duration, number of parties, and time of day in one minute segments), NPA, NXX and LATA,
and much more.

That's 4,000,000 calls per minute (all multi record)

Typical time for a complete run was 20 to 30 minutes (a complete days processing). This was done
in 'C' on HP 9000's.

We would not use relational databases at this point, but rather a home built structure that I guess
would be extremely similar to a Python dictionary. I can get into details if requested.

Of course we didn't use SHA, the hash was a modified version from what found in 'Dragon book'
by Aho, Sethi and Ullman. (not the real title, but known everywhere by this name)

Hash tables were fixed length (usually a prime number of entries) with lateral expansion (linked lists). This
proved to have amazingly fast access times.

To make this more interesting, although large blocks were used, data was stored on disk.

This all done back in 1992 - 1994

***micseydel*** · Dec-12-2016, 09:30 PM

That 100000 in the hashlib function call is the number of iterations. It's specifically meant to be slow :)

wavic · (This post was last modified: Dec-13-2016, 06:14 AM by wavic.)

Split the job to more than one workers. Try multiprocessing for example.
Also, look at this topic in the old forum. Might help

http://python-forum.org/viewtopic.php?f=6&t=20222

encyption help slow perfomance

User Panel Messages

Announcements