Python Forum

Full Version: Compare folder A and subfolder B and display files that are in folder A but not in su
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Create a code that compares c:\Folder-Oana\extracted\ and c:\Folder-Oana\extracted\translated\ and shows me the files that are in the first folder, but not in the second. So, compare folder A and subfolder B and display files that are in folder A but not in subfolder B. So I have 1880 html files in folder A and only 50 html files in FOLDER B. So, cod must show me each files, the different between 1880 - 50.

This is my code (2 versions). I try with ChatGPT another, more ways, but didn't work. I believe the problem is that there are FOLDER + SUBFOLDER, not different folders with no other subfolders in it.

Version 1

import os

folder1 = r"C:\Folder-Oana\extracted"
folder2 = r"C:\Folder-Oana\extracted\translated"

# Obține lista de fișiere HTML din fiecare folder
html_files_folder1 = [f.lower() for f in os.listdir(folder1) if f.lower().endswith('.html')]
html_files_folder2 = [f.lower() for f in os.listdir(folder2) if f.lower().endswith('.html')]

# Găsește diferențele între cele două liste de fișiere
missing_files = list(set(html_files_folder1) - set(html_files_folder2))

# Afișează fișierele care lipsesc
if missing_files:
    print("Fișierele HTML care se găsesc în folderul 1, dar nu în folderul 2, sunt:")
    for filename in missing_files:
        print(filename)
else:
    print("Nu există fișiere HTML care se găsesc în folderul 1, dar nu în folderul 2.")
Version 2

import os

folder1 = r'C:\Folder-Oana\extracted\translated'
folder2 = r'C:\Folder-Oana\extracted'

# Funcție pentru a returna lista de fișiere HTML dintr-un folder
def get_html_files(folder):
    html_files = []
    for root, dirs, files in os.walk(folder):
        for file in files:
            if file.lower().endswith('.html'):
                html_files.append(file)
    return html_files

# Obține lista de fișiere HTML pentru fiecare folder
html_files_folder1 = get_html_files(folder1)
html_files_folder2 = get_html_files(folder2)

# Verifică fișierele care se găsesc în folderul 1, dar nu în folderul 2
missing_files = [file for file in html_files_folder1 if file not in html_files_folder2]

# Afișează fișierele și folderul corespunzător în care se găsesc
for file in missing_files:
    if file in html_files_folder1:
        print(f"Fișierul {file} se găsește în folderul {folder1}")
    if file in html_files_folder2:
        print(f"Fișierul {file} se găsește în folderul {folder2}")
for getting list of files, use pathlib instead of os:
from pathlib import Path


home = Path(".")
base = home / "Folder-Oana"

folder1 = base / "extracted"
folder2 = base / "translated"

def get_folder_contents(dirname, filter):
    if filter:
        return [file for file in dirname.iterdir() if file.is_file() and file.suffix == filter]
    else:
        return [file for file in dirname.iterdir() if file.is_file()]

html_files_folder1 = get_folder_contents(dirname=folder1, filter=".html")
html_files_folder2 = get_folder_contents(dirname=folder2, filter=".html")
Note that both paths are sub-directories of Folder-Oana
In version 2, the get_folder_contents get all files in folder, plus files that are in subdirectories of folder. For obvious reasons this will not work if you want to find out what files are in C:\Folder-Oana\extracted but are not in C:\Folder-Oana\extracted\translated. All files in html_files_folder2 will also be in html_files_folder1 because folder 2 is a subdirectory of folder 1.

I don't know why version 1 wouldn't work. It worked fine for me.
import os

a = set(f.lower() for f in os.listdir(".") if f.lower().endswith(".py"))
b = set(f.lower() for f in os.listdir("./test") if f.lower().endswith(".py"))

print("A or B", *(a | b), sep="\n")
print("", "A and B", *(a & b), sep="\n")
print("", "A but not B", *(a - b), sep="\n")
print("", "B but not A", *(b - a), sep="\n")
Output:
A or B junk.py console.py junk2.py junk3 copy.py junk3.py pythonhighlighter.py interactiveconsole.py sqlite_demo.py monkeypatching.py A and B junk.py junk2.py junk3.py A but not B console.py pythonhighlighter.py interactiveconsole.py sqlite_demo.py monkeypatching.py B but not A junk3 copy.py
There may be a slight problem with Larz60+ code. Files with the extension ".HTML" will not be included in the list because "html" != "HTML".

Stop using "\" and start using "/" for the separator in file paths. Windows accepts "/" and it eliminates the confusion of "\" maybe being the start of an escape sequence.
Try glob?

import glob
path1 = '/home/pedro/summer2021/**/'
path2 = '/home/pedro/summer2021/EC/*'
all_files1 = glob.glob(path1 + '*.odt')
all_files2 = glob.glob(path2 + '*.odt')
intersect = list(set(all_files1).intersection(set(all_files2))
exceptions = [f for f in all_files1 if not f in intersect]
len(exceptions) #26
len(all_files1) #33
len(intersection) #7
If memory is a problem you can use iglob()

all_files = glob.iglob(path + '*.odt')
iglob returns a generator, not a list.