Python Forum
hashlib md5 - Different hashes for requests content
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
hashlib md5 - Different hashes for requests content
#1
Hello everyone,

I have noticed a very strange behavior with the hashlib library in connection with md5 hashes -> Python 3.11.9

URLs are transferred via my application / API. Behind the URLs are assets (video, images or PDFs) that I have to download.
The md5 hash of the asset byte object is used to determine whether this is a new or existing asset. If this md5 hash already exists in the
DB, a reference is created for the webshop. Otherwise the asset is transferred to the shop as new. This is the simplified version of the project.

Up to 10000 URLs can be transferred per API request.
Now I have noticed during the tests that assets are sometimes transferred as new, although they should already exist.

Here is a simplified code snippet with which I was able to narrow down the error

def get_md5_hash(url):
    asset_md5_hash = hashlib.md5()
    r = requests.get(url, allow_redirects=True, timeout=5, stream=True)
    try:
        r.raise_for_status()
    except Exception as error:
        print(f"an error occurred, error desc: '{error}'")
    else:
        r.raw.decode_content = True
        try:
            for line in r.iter_content(chunk_size=1024):
                if line:
                    asset_md5_hash.update(line)
        except Exception as error:
            print(f"could not read bytes object, {error}")
        else:
            print(f"md5 hash: {asset_md5_hash.hexdigest()}")
If I run this function 100 times in a for loop, I get the hash “8c702e1eda4d55f4b11d1eabf7738a0e” 98 times and “46651ab690a01143cbb5279eabf0909a” 2 times.

md5 hash: 8c702e1eda4d55f4b11d1eabf7738a0e
md5 hash: 8c702e1eda4d55f4b11d1eabf7738a0e
md5 hash: 8c702e1eda4d55f4b11d1eabf7738a0e
md5 hash: 8c702e1eda4d55f4b11d1eabf7738a0e
md5 hash: 8c702e1eda4d55f4b11d1eabf7738a0e
md5 hash: 8c702e1eda4d55f4b11d1eabf7738a0e
md5 hash: 8c702e1eda4d55f4b11d1eabf7738a0e
md5 hash: 8c702e1eda4d55f4b11d1eabf7738a0e
md5 hash: 8c702e1eda4d55f4b11d1eabf7738a0e
[…]
md5 hash: 8c702e1eda4d55f4b11d1eabf7738a0e
md5 hash: 46651ab690a01143cbb5279eabf0909a
md5 hash: 8c702e1eda4d55f4b11d1eabf7738a0e
md5 hash: 46651ab690a01143cbb5279eabf0909a
md5 hash: 8c702e1eda4d55f4b11d1eabf7738a0e
md5 hash: 8c702e1eda4d55f4b11d1eabf7738a0e
md5 hash: 8c702e1eda4d55f4b11d1eabf7738a0e
[…]
How can this be?
Do you see a possibility for a quick workaround?

Many thanks for your help
Reply
#2
My assumption is that what you're receiving is not identical. Rather than creating a md5 hash, can you save the data to disk and compare if any of the requests are returning different or truncated data?
Reply
#3
Hash collisions (same hash key for different inputs) is a well-known MD5 weakness.
Reply
#4
(Dec-02-2024, 07:02 PM)deanhystad Wrote: Hash collisions (same hash key for different inputs) is a well-known MD5 weakness.

I think the concern here is that supposedly identical inputs are generating different hashes.

I'm not aware of any particular issue that makes MD5 weak for unintentional collisions. (Direct attacks being a different beast).
Reply
#5
Hashing by nature has collisions. The initial cell is located immediately with the hash code. This requires only one access to the start of the data, and thus why hashing is so fast. This first cell has a very good chance of being the cell you are looking for.

If not the first cell, various chaining methods are used to search for the proper cell. For example, the next cell might be found by adding an offset to the original hash code, thus pointing to the next cell in the collision sequence (length of this expansion is predefined at hashtable creation time), or It might be a linked list. All depends on the algorithm being used.

I use a modified version of the hashing algorithm presented by John Aho in the Dragon Book ( Compiler design ). The modification adds a linked list for expansion. The advantage of this type of hash is that predefined expansion room for collisions is not necessary. The expansion takes place when needed wuing a linked list. This has proved successful in extrenely large databases (over 100 million records), Almost all chains have been very short, allowing for very quick access.

google 'symbol tables in compiler design, Aho' for more on Aho's method.
or google scholar here.
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  Non cryptographic hashes AndrzejB 3 1,700 Mar-21-2023, 07:36 PM
Last Post: AndrzejB
  [SOLVED] How to crack hash with hashlib Milan 0 2,290 Mar-09-2023, 08:25 PM
Last Post: Milan
  Python3 hashlib ogautier 1 2,870 Mar-28-2022, 04:42 AM
Last Post: snippsat
  how can I generate a password in hashlib go127a 20 11,853 May-19-2019, 09:26 AM
Last Post: buran
  Confusion about Hashlib Vysero 2 3,702 Jun-25-2018, 04:05 PM
Last Post: DeaD_EyE
  Using SHA3 (keccak) from Hashlib CryptoFlo 0 9,482 Mar-14-2018, 10:56 AM
Last Post: CryptoFlo
  Code that generates MD5 hashes from IPv6 addresses giving differant answers? PyMD5 4 7,590 Oct-17-2016, 02:39 AM
Last Post: Skaperen

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020