Python Forum

Hi, I am trying to get a script written up to rename PDF and word documents in a certain folder by using the text in the documents.
I need them named: LASTNAME.Firstname CLIENT
the information would come from a timesheet so people would be entering that information into a table [attachment=2777]- would this be possible ?
What I have so far is this, but I don't know how to get the information from the document table as each one would be different?

I am open to any suggestions on how this could work

import os
from docx import Document
import PyPDF2

def extract_text_from_docx(docx_path):
    doc = Document(docx_path)
    text = ""
    for paragraph in doc.paragraphs:
        text += paragraph.text
    return text

def extract_text_from_pdf(pdf_path):
    text = ""
    with open(pdf_path, 'rb') as file:
        reader = PyPDF2.PdfFileReader(file)
        for page_num in range(reader.numPages):
            text += reader.getPage(page_num).extractText()
    return text

def main():
    directory = "path/to/your/documents"
    for filename in os.listdir(directory):
        if filename.endswith(".docx"):
            full_path = os.path.join(directory, filename)
            new_name = extract_text_from_docx(full_path)
        elif filename.endswith(".pdf"):
            full_path = os.path.join(directory, filename)
            new_name = extract_text_from_pdf(full_path)
        else:
            continue
        # Rename the file
        os.rename(full_path, os.path.join(directory, new_name + os.path.splitext(filename)[1]))

if __name__ == "__main__":
    main()

Hi,
If your timesheet has a grid with explicit lines, it might be an idea to look at pdfPlumber.
I find that it handles pdfs with that particular feature very well.

set1 = {
                "vertical_strategy": "explicit",
                "horizontal_strategy": "explicit",
                "explicit_vertical_lines": page.curves+page.edges,
                "explicit_horizontal_lines": page.curves+page.edges}

            text = page.extract_tables(table_settings=set1)

You get lists for every line, where every element is a "field" you can use.
Paul

I made a docx with a table, saved it also as PDF.

Assuming:
1. you only have 1 table per document, or the info you want is in the first table
2. the name is in row 1 column 2 in the table,

this gets the name. The easiest is to put the data in a dataframe I think.

I read fitz (PyMuPDF) is more advanced than PyPDF2

from docx import Document
import pandas as pd
import fitz

mydoc = '/home/pedro/myPython/pdfplumber/pdfs/table_docx.docx'
mypdf = '/home/pedro/myPython/pdfplumber/pdfs/table_docx.pdf'

# name from docx    
for table in Document(mydoc).tables:
    data = [[cell.text for cell in row.cells] for row in table.rows]
df = pd.DataFrame(data)
name = df.iloc[0,1] # 'John Smith'

# get name from pdf
doc = fitz.open(mypdf)
for page in doc:
    tabs = page.find_tables()
df = pd.DataFrame(tabs[0].extract())
name = df.iloc[0,1] # 'John Smith'

lisa_d

DPaul

Pedroski55