Python Forum

Hi,

I am going to work at a government agency as a financial fraud IT detective. At the agency we confiscate computers from suspected individuals and companies. We create images of these harddrive and save them on our server. My task is the find the right files for detectives to search for evidence of fraud.

I already learned some basic Python skills. I would like to learn python skills for searching through a lot of files. What should i focus on? Should i use books or a online course?

Any input is much appreciated!

(Mar-05-2022, 05:14 AM)MaartenRo Wrote: [ -> ]I already learned some basic Python skills. I would like to learn python skills for searching through a lot of files. What should i focus on? Should i use books or a online course?

For this should look into os module and pathlib(a more modern way).
To give example let say want to find a .txt file and that the search is recursively(all sub-folders) for this file.

from pathlib import Path

dest = r'C:\Test'
for path in Path(dest).rglob('*.txt'):
    if path.is_file():
        print(path)

The same with os.walk.

import os

dest = r'C:\Test'
for root, dirs, files in os.walk(dest):
    for file in files:
        if '.txt' in file:
            print(os.path.join(root, file))

I would say that looking most into pathlib(Taming the File System) is the way to go.
Also learn command line tools like find, grep, awk ...ect

Thank you for your answer!
Can i also use the os module or pathlib for searching keyword in files with text, like Word, Excel or PDF? Or can i use another module for this?

(Mar-06-2022, 07:28 AM)MaartenRo Wrote: [ -> ]Can i also use the os module or pathlib for searching keyword in files with text, like Word, Excel or PDF? Or can i use another module for this?

You will need addition module used alone or in combination with tool mention,
these are binary files so need modules that can covert into text.
Example for .pdf in this Thread

import pdfplumber
 
pdf_file = "sample.pdf"
search_word = 'text'
with pdfplumber.open(pdf_file) as pdf:
    pages = pdf.pages
    for page_nr, pg in enumerate(pages, 1):
        content = pg.extract_text()
        if search_word in content:
            print(f'<{search_word}> found at page number <{page_nr}> '\
                    f'at index <{content.index(search_word)}>')

Output:<text> found at page number <1> at index <119>
<text> found at page number <2> at index <56>

Also regex is tool you should look more into,you see me use it last post.
Regex is very powerful for all kind of thing,eg like eg finding exact match of a word or part of it in a file.
grep dos similar stuff from command line

For word python-docx

For Excel i use Pandas that is easy to use(pd.read_excel()) and write(df.fo_excel()).
Also get similar look DataFrame as Excel when have read it in.

Other modules eg openpyxl | pyexcel .

MaartenRo

snippsat

MaartenRo

snippsat