Hi,
I am going to work at a government agency as a financial fraud IT detective. At the agency we confiscate computers from suspected individuals and companies. We create images of these harddrive and save them on our server. My task is the find the right files for detectives to search for evidence of fraud.
I already learned some basic Python skills. I would like to learn python skills for searching through a lot of files. What should i focus on? Should i use books or a online course?
Any input is much appreciated!
(Mar-05-2022, 05:14 AM)MaartenRo Wrote: [ -> ]I already learned some basic Python skills. I would like to learn python skills for searching through a lot of files. What should i focus on? Should i use books or a online course?
For this should look into
os module and
pathlib(a more modern way).
To give example let say want to find a
.txt
file and that the search is recursively(all sub-folders) for this file.
from pathlib import Path
dest = r'C:\Test'
for path in Path(dest).rglob('*.txt'):
if path.is_file():
print(path)
The same with
os.walk.
import os
dest = r'C:\Test'
for root, dirs, files in os.walk(dest):
for file in files:
if '.txt' in file:
print(os.path.join(root, file))
I would say that looking most into pathlib(
Taming the File System) is the way to go.
Also learn command line tools like find, grep, awk ...ect
Thank you for your answer!
Can i also use the os module or pathlib for searching keyword in files with text, like Word, Excel or PDF? Or can i use another module for this?
(Mar-06-2022, 07:28 AM)MaartenRo Wrote: [ -> ]Can i also use the os module or pathlib for searching keyword in files with text, like Word, Excel or PDF? Or can i use another module for this?
You will need addition module used alone or in combination with tool mention,
these are binary files so need modules that can covert into text.
Example for
.pdf
in this
Thread
import pdfplumber
pdf_file = "sample.pdf"
search_word = 'text'
with pdfplumber.open(pdf_file) as pdf:
pages = pdf.pages
for page_nr, pg in enumerate(pages, 1):
content = pg.extract_text()
if search_word in content:
print(f'<{search_word}> found at page number <{page_nr}> '\
f'at index <{content.index(search_word)}>')
Output:
<text> found at page number <1> at index <119>
<text> found at page number <2> at index <56>
Also
regex
is tool you should look more into,you see me use it last
post.
Regex is very powerful for all kind of thing,eg like eg finding exact match of a word or part of it in a file.
grep dos similar stuff from command line
For word
python-docx
For Excel i use
Pandas that is easy to use(
pd.read_excel()
) and write(
df.fo_excel()
).
Also get similar look
DataFrame as Excel when have read it in.
Other modules eg
openpyxl |
pyexcel .