Breaking Hash Functions With Python

⚠️
This article is for educational purposes only. Breaking hash functions or attempting to compromise security systems without explicit permission is illegal and unethical. Always use your knowledge responsibly and within legal boundaries.

Hash functions are essential components of modern cryptography and data integrity verification. They take an input of arbitrary length and produce a fixed-size output, called a hash or digest. Ideally, hash functions should be one-way (impossible to reverse) and collision-resistant (difficult to find two inputs that produce the same output). However, various techniques can be used to “break” hash functions, either by finding collisions or reversing the process. In this article, we’ll explore some of these techniques using Python, focusing on modern approaches and hardware-accelerated solutions when possible.

1. Brute Force Attack with Hardware Acceleration

One of the simplest approaches to break a hash function is through brute force. This involves trying all possible inputs until a match is found. We’ll use the hashlib library with OpenSSL backend for hardware acceleration when available.

brute_force.py
import hashlib
import itertools
import time
from concurrent.futures import ProcessPoolExecutor, as_completed

def hash_function(data):
    return hashlib.sha256(data.encode()).hexdigest()

def brute_force(target_hash, charset, max_length, num_processes=4):
    def generate_strings(start, end):
        for length in range(1, max_length + 1):
            for combination in itertools.islice(itertools.product(charset, repeat=length), start, end):
                yield ''.join(combination)

    total_combinations = sum(len(charset) ** i for i in range(1, max_length + 1))
    chunk_size = total_combinations // num_processes

    with ProcessPoolExecutor(max_workers=num_processes) as executor:
        futures = []
        for i in range(num_processes):
            start = i * chunk_size
            end = None if i == num_processes - 1 else (i + 1) * chunk_size
            futures.append(executor.submit(process_chunk, target_hash, charset, max_length, start, end))

        for future in as_completed(futures):
            result = future.result()
            if result:
                return result

    return None

def process_chunk(target_hash, charset, max_length, start, end):
    for attempt in generate_strings(start, end):
        if hash_function(attempt) == target_hash:
            return attempt
    return None

# Example usage
target = "9f86d081884c7d659a2feaa0c55ad015a3bf4f1b2b0b822cd15d6c15b0f00a08"  # SHA-256 hash of "test"
charset = "abcdefghijklmnopqrstuvwxyz"
max_length = 4

start_time = time.time()
result = brute_force(target, charset, max_length)
end_time = time.time()

print(f"Found: {result}")
print(f"Time taken: {end_time - start_time:.2f} seconds")

This script uses multiprocessing to distribute the workload across multiple CPU cores, significantly speeding up the brute force process. It also uses the SHA-256 hash function, which is more secure than MD5.

2. Rainbow Table Attack with Memory-Efficient Storage

Rainbow tables are precomputed tables of hash chains that can be used to crack password hashes much faster than brute force methods. We’ll use a more memory-efficient approach to store and query the rainbow table.

rainbow_table.py
import hashlib
import itertools
import pickle
import os
from tqdm import tqdm

def hash_function(data):
    return hashlib.sha256(data.encode()).hexdigest()

def generate_rainbow_table(charset, max_length, table_size):
    rainbow_table = {}
    total_combinations = sum(len(charset) ** i for i in range(1, max_length + 1))

    with tqdm(total=total_combinations, desc="Generating Rainbow Table") as pbar:
        for length in range(1, max_length + 1):
            for word in itertools.product(charset, repeat=length):
                word = ''.join(word)
                hash_value = hash_function(word)
                rainbow_table[hash_value] = word
                pbar.update(1)

                if len(rainbow_table) >= table_size:
                    return rainbow_table

    return rainbow_table

def save_rainbow_table(rainbow_table, filename):
    with open(filename, 'wb') as f:
        pickle.dump(rainbow_table, f)

def load_rainbow_table(filename):
    with open(filename, 'rb') as f:
        return pickle.load(f)

def crack_hash(target_hash, rainbow_table):
    return rainbow_table.get(target_hash)

# Example usage
charset = "abcdefghijklmnopqrstuvwxyz"
max_length = 5
table_size = 1000000  # Limit table size to save memory
table_filename = "rainbow_table.pkl"

if not os.path.exists(table_filename):
    rainbow_table = generate_rainbow_table(charset, max_length, table_size)
    save_rainbow_table(rainbow_table, table_filename)
else:
    rainbow_table = load_rainbow_table(table_filename)

target_hash = "9f86d081884c7d659a2feaa0c55ad015a3bf4f1b2b0b822cd15d6c15b0f00a08"  # SHA-256 hash of "test"
result = crack_hash(target_hash, rainbow_table)
print(f"Cracked: {result}")

This script generates a rainbow table for SHA-256 hashes, limiting its size to conserve memory. It also uses pickle to save and load the table, allowing for reuse in future cracking attempts.

3. Collision Attack on MD5 (Real-life Example)

While modern hash functions like SHA-256 are resistant to collision attacks, older functions like MD5 are vulnerable. Here’s a real-life example of finding MD5 collisions, which was famously used to create two different PDF files with the same MD5 hash1.

md5_collision.py
import hashlib
import random
import struct

def md5_compress(block, state):
    return hashlib.md5(state + block).digest()[:16]

def find_collision(prefix):
    if len(prefix) % 64 != 0:
        raise ValueError("Prefix length must be a multiple of 64 bytes")

    state = hashlib.md5(prefix).digest()[:16]

    def find_block_collision(s):
        while True:
            x = random.getrandbits(512)
            x_bytes = struct.pack('>16I', *[x >> (32*i) & 0xFFFFFFFF for i in range(16)])
            h1 = md5_compress(x_bytes, s)

            y = x ^ (1 << random.randint(0, 511))
            y_bytes = struct.pack('>16I', *[y >> (32*i) & 0xFFFFFFFF for i in range(16)])
            h2 = md5_compress(y_bytes, s)

            if h1 == h2:
                return x_bytes, y_bytes

    block1, block2 = find_block_collision(state)
    return prefix + block1, prefix + block2

# Example usage
prefix = b"This is a prefix" * 4  # Must be multiple of 64 bytes
collision1, collision2 = find_collision(prefix)

print(f"Collision found:")
print(f"Input 1: {collision1.hex()}")
print(f"Input 2: {collision2.hex()}")
print(f"MD5 Hash: {hashlib.md5(collision1).hexdigest()}")
print(f"MD5 Hash: {hashlib.md5(collision2).hexdigest()}")

This script demonstrates a practical MD5 collision attack. It finds two different inputs that produce the same MD5 hash, starting with a given prefix. This technique can be used to create seemingly identical documents (like contracts or certificates) with different content but the same MD5 hash.

4. Length Extension Attack on SHA-256

Length extension attacks are possible on hash functions that use the Merkle–Damgård construction, including SHA-256. This attack allows an attacker to append data to a message and compute a valid hash for the extended message without knowing the original message.

length_extension.py
import struct
import hashlib

def sha256_padding(message_length):
    padding = b'\x80' + b'\x00' * ((56 - (message_length + 1) % 64) % 64)
    padding += struct.pack('>Q', message_length * 8)
    return padding

def length_extension_attack(original_hash, original_length, append_message, secret_length):
    state = [int(original_hash[i:i+8], 16) for i in range(0, 64, 8)]

    padded_length = secret_length + original_length
    padding = sha256_padding(padded_length)

    new_message = padding + append_message
    h = hashlib.sha256()
    h._h = tuple(state)
    h.update(append_message)
    h._total_len = (padded_length + len(append_message)) * 8

    new_hash = h.hexdigest()
    return new_hash, new_message.hex()

# Example usage
secret = b"secret_key"
original_message = b"data"
append_message = b"&admin=true"

# Simulate the original hash (in a real scenario, this would be provided)
h = hashlib.sha256()
h.update(secret + original_message)
original_hash = h.hexdigest()

new_hash, new_message_hex = length_extension_attack(
    original_hash,
    len(original_message),
    append_message,
    len(secret)
)

print(f"Original hash: {original_hash}")
print(f"New hash: {new_hash}")
print(f"New message (hex): {new_message_hex}")

# Verify the attack
verify = hashlib.sha256(secret + original_message + bytes.fromhex(new_message_hex)).hexdigest()
print(f"Verification hash: {verify}")
print(f"Attack {'succeeded' if new_hash == verify else 'failed'}")

This script demonstrates a length extension attack on SHA-256. It allows an attacker to append data to a hash without knowing the original secret key, which can be used to forge authentication in some poorly designed systems.

Conclusion

While these techniques can be used to break or exploit weaknesses in hash functions, it’s crucial to remember that modern cryptographic hash functions are designed to be resistant to such attacks. Always use up-to-date, secure hash functions for any security-critical applications, and be aware of their limitations and potential vulnerabilities.

graph TD
    A[Input] --> B[Hash Function]
    B --> C[Hash Output]
    D[Brute Force] --> |Hardware Accelerated|C
    E[Rainbow Table] --> |Memory Efficient|C
    F[Collision Attack] --> |MD5 Vulnerability|C
    G[Length Extension] --> |SHA-256 Merkle–Damgård Construction|C
    C --> H{Broken?}
    H -->|Yes| I[Compromised]
    H -->|No| J[Secure]
    K[Modern Hash Functions] --> L[Resistant to Attacks]
    L --> J

This diagram illustrates the various attack vectors we’ve discussed and how they relate to breaking hash functions, as well as the importance of using modern, resistant hash functions.

For more detailed research on hash function attacks and cryptographic security, refer to:

  1. Cryptanalysis of MD5 and SHA-1 by Xiaoyun Wang and Hongbo Yu
  2. Hash Functions: Theory, Attacks, and Applications by Ilya Mironov

These papers provide in-depth analysis of various hash function vulnerabilities and attack methods.


  1. Stevens, M., Lenstra, A., & de Weger, B. (2007). Chosen-prefix collisions for MD5 and colliding X.509 certificates for different identities. In Annual International Conference on the Theory and Applications of Cryptographic Techniques (pp. 1-22). Springer, Berlin, Heidelberg. ↩︎