Python Forum

Hi forum I'm new at coding, I have created a code that reads in a .docx file. In that code I have a function that executes a search, assigns it to a variable and appends it to a list that then sends it to an excel. My problem is that the function is being called the number of .docx in a folder but this function runs the same word.docx instead of executing once for each word file. So my output is an excel with the same info twice (I have 2 word.docx files in the folder)

How do I fix this ? I have tried multiple codes without success. [attachment=1897][attachment=1897]

Your problem is that first you do this:

for f in os.listdir(path):
    if f.endswith('.docx'):
        files.append(f)

for i in range(len(files)):
    text = docx2txt.process(files[i])
    text2 = text.replace(":", " ")
    text3 = text2.replace(",", " ")
    text4 = text3.replace("_", " ")
    data = text4.split()

Then later on you do this:

#Sends vet list to string
for j in data:
	vet += j + ", "

Was it your plan for vet to concatenate the results for all the files? That is not what happens. Your program only uses data from the last docx file. You should combine finding, processing and appending into one loop. Like this:

vet = ""
for f in os.listdir(path):
    if f.endswith('.docx'):
        text = docx2txt.process(f)
        text = text.replace(":", " ")
        text = text.replace(",", " ")
        text = text.replace("_", " ")
        data = text.split()
        vet += ", ".join(data)

Okay thanks, no it wasn't my plan. I am still learning. I have replaced my code with what you came up :) But now for some reason, my search function isn't working for the second word.docx. It is giving me this: see picture attachment. Instead of only the words " Neveu Transport" in the excel.

I am guessing their is something wrong with the " vet += ", ".join(data)" ?

thank you for you help.

vet = ""
for f in os.listdir(path):
    if f.endswith('.docx'):
        text = docx2txt.process(f)
        text = text.replace(":", " ")
        text = text.replace(",", " ")
        text = text.replace("_", " ")
        data = text.split()
        vet += ", ".join(data)

[/quote]

That fixed the first error, only processing one file. There are more.

The next error is that all the docx results are appended to vet. Maybe you want to process each file independently? That would look like this:

for f in os.listdir(path):
    if f.endswith('.docx'):
        text = docx2txt.process(f).replace(":", " ").replace(",", " ").replace("_", " ")
        ma(", ".join(text.split()))

def ma(vet):
   ...

Please try to follow forum rules and post code by pasting into your post surrounded by Python tags.

Great! Everything is working now:)

thank you so much

Your data should be organized as a list of lists, not 12 independent lists. I would have ma() return a row (list) and data1 would be a list of rows.
Something like this:

def my_match(string, pattern):
    """Find pattern in string.  Return first "group" stripped of commas"""
    match = re.search(pattern, string)
    if match:
        return match.group(1).replace(",", "")
    return ""

def ma(data):
    vet = ", ".join(data)
    return [
        my_match('Transport, (.*)Contact', vet),
        "",
        data[0],
        my_match('Date, (.*)Numéro', vet),
        my_match('Prix, (.*)Prix', vet),
        my_match('Compte, (.*)I.D', vet),
        my_match('DBS, (.*)Transport', vet),
        "",
        "",
        my_match('par, (.*)De', vet),
        my_match('De, (.*)À', vet),
        my_match('À, (.*)Attention', vet)
    ]

data1 = []
for f in os.listdir(path):
    if f.endswith('.docx'):
        text = docx2txt.process(f).replace(":", " ").replace(",", " ").replace("_", " ")
        data1.append(ma(text.split()))

columns = [
    "Transporteur",
    "#Fournisseur",
    "FT#",
    "Date ceuillette",
    "Prix",
    "GL",
    "PO#",
    "IMACS/CC/W/O",
    "Notes si requis",
    "Transport demandé par",
    "Origine",
    "Destination",
]

df1 = pd.DataFrame(data1, columns)

When you see yourself typing the same thing over and over:

    result = re.search('Transport, (.*)Contact', vet)
    result_1 = (result.group(1)).replace(",", "")
    Tra = result_1

Write a function.

def my_match(string, pattern):
    """Find pattern in string.  Return first "group" stripped of commas"""
    match = re.search(pattern, string)
    if match:
        return match.group(1).replace(",", "")
    return ""

The function reduces typing and chances for typing errors. The function body makes it easy to document the important processing that you are repeating over and over. The function makes it easy to add functionality. Here I check if a match is found and return an empty string if it isn't.

So im running into a problem, I want to make this code run every 10 seconds for example. For some reason python doesn't recognise the value "text" when inserted in a function.

 
import os
import docx2txt
import re
import pandas as pd
import numpy as np
import openpyxl
import time
import schedule

#variables
path = r"C:\Users\eschbachm\OneDrive - EXP\Desktop\test"
os.chdir(path)

#Colonne total
col1 = []
col2 = []
col3 = []
col4 = []
col5 = []
col6 = []
col7 = []
col8 = []
col9 = []
col10 = []
col11 = []
col12 = []
#lists
vet = ""
        
#Sends vet list to string
for j in vet:
	vet += j + ", "     

def ma(vet):
    #Colonne 1
    result = re.search('Transport, (.*)Contact', vet)
    result_1 = (result.group(1)).replace(",", "")
    Tra = result_1
    col1.append(Tra)
    #Colonne 2
    VQ = ''
    col2.append(VQ)
    #Colonne 3
    result = re.search('(.*)LOCATION', vet)
    result_3 = (result.group(1)).replace(",", "")
    FT = result_3
    col3.append(FT)
    #Colonne 4
    result = re.search('Date, (.*)Numéro', vet)
    result_4 = (result.group(1)).replace(",", "")
    Date = result_4
    col4.append(Date)
    #Colonne 5
    result = re.search('Prix, (.*)Modèle', vet)
    result_5 = (result.group(1)).replace(",", "")
    Prix = result_5
    col5.append(Prix)
    #Colonne 6
    result = re.search('Compte, (.*)Accessoires', vet) #recherche valeur de GL
    result_6 = (result.group(1)).replace(",", "")
    GL = result_6
    #Colonne 7
    result = re.search('DBS, (.*)Transport', vet) #recherche valeur de PO
    result_7 = (result.group(1)).replace(",", "")
    PO = result_7

    a = 0
    b = 0
    c = 0
    for line in text:
        # checking string is present in line or not
        if GL != "" and PO != "": #si Gl et PO sont present en meme temps
            a = 1
        elif GL != "": #si Gl est present et non PO
            b = 2
        elif PO != "": #si PO est present et non GL
            c = 3
            break
    if a == 0:
        pass
    else: #si Gl et PO sont present en meme temps
        col6.append(GL)
        col7.append(PO)
    if b == 0:
        pass
    else: #si Gl est present et non PO
        col6.append(GL)
        col7.append('')
    if c == 0:
        pass
    else: #si PO est present et non GL
        col6.append('')
        col7.append(PO)
    if a == 0 and b == 0 and c == 0:
        col6.append('')
        col7.append('')

    #Colonne 8
    IMACS = ''
    col8.append(IMACS)
    #Colonne 9
    Notes = ''
    col9.append(Notes)
    #Colonne 10
    result = re.search('par, (.*)De', vet) #recherche valeur de DP
    result_10 = (result.group(1)).replace(",", "")
    DP = result_10
    col10.append(DP)
    #Colonne 11
    result = re.search('De, (.*)À', vet) #recherche valeur de origine
    result_11 = (result.group(1)).replace(",", "")
    ORI = result_11
    col11.append(ORI)
    #Colonne 12
    result = re.search('À, (.*)Prix', vet) #recherche valeur de destination
    result_12 = (result.group(1)).replace(",", "")
    DEST = result_12
    col12.append(DEST)

def run():
        for f in os.listdir(path):
                if f.endswith('.docx'):
                        text = docx2txt.process(f).replace(":", " ").replace(",", " ").replace("_", " ")
                        ma(", ".join(text.split()))
                        print('Transfert de donnée Réussi!')

# Creating the first Dataframe using dictionary
data1 = {
    "Transporteur": col1,
    "#Fournisseur": col2,
    "FT#": col3,
    "Date ceuillette": col4,
    "Prix": col5,
    "GL": col6,
    "PO#": col7,
    "IMACS/CC/W/O": col8,
    "Notes si requis": col9,
    "Transport demandé par": col10,
    "Origine": col11,
    "Destination": col12}

df1 = pd.DataFrame(data=data1)
df1 = df1.sort_values('Date ceuillette', ascending=True)

# load df to existing excel
with pd.ExcelWriter('output.xlsx', mode='a', if_sheet_exists="replace") as writer:
    df1.to_excel(writer, sheet_name='Sheet_name1')

schedule.every(10).seconds.do(run)

while 1:
        schedule.run_pending()
        time.sleep(2)

Python gives me this error:
Traceback (most recent call last):
File "C:\Users\eschbachm\OneDrive - EXP\Desktop\code\CAT - Version Finale.py", line 149, in <module>
run()
File "C:\Users\eschbachm\OneDrive - EXP\Desktop\code\CAT - Version Finale.py", line 123, in run
ma(", ".join(text.split()))
File "C:\Users\eschbachm\OneDrive - EXP\Desktop\code\CAT - Version Finale.py", line 69, in ma
for line in text:
NameError: name 'text' is not defined. Did you mean: 'next'?

Complaining that text is not defined is a valid complaint. There is no variable named "text" defined in ma() or in the global scope. There is a variable named "text" defined in run(), but that variable, like all local variables in a function, is not visible outside run().

Looking at you initial code, ma() used "data" and "vet". You got data by doing this:

    text = docx2txt.process(files[i])
    text2 = text.replace(":", " ")
    text3 = text2.replace(",", " ")
    text4 = text3.replace("_", " ")
    data = text4.split()

which is the same as:

text = docx2txt.process(f).replace(":", " ").replace(",", " ").replace("_", " ")

data = text.split()
And you got vet by doing this:

vet = ""
for j in data:
	vet += j + ", "

which is the same as this:

", ".join(text.split())

Since ma() needs both data and vet, and vet is easily created from data, I think it makes more sense to pass data to ma() and have ma() create vet.

def ma(data):
    vet = ", ".join(data)
    ...

def run():
        for f in os.listdir(path):
                if f.endswith('.docx'):
                        text = docx2txt.process(f).replace(":", " ").replace(",", " ").replace("_", " ")
                        ma(text.split())  # This creates data from text and passes it to ma
                        print('Transfert de donnée Réussi!')

Hi thanks alot for helping me in your free time,

I am still not understanding this:

 
def ma(data):
    vet = ", ".join(data)
    ...
    for line in text:
        # checking string is present in line or not
    ...
def run():
        for f in os.listdir(path):
                if f.endswith('.docx'):
                        text = docx2txt.process(f).replace(":", " ").replace(",", " ").replace("_", " ")
                        ma(text.split())
                        print('Transfert de donnée Réussi!')

python is still throwing me a code because "text" in ma() is not defined.
How can I link the two "text" variables in both functions?
I have made the changes you suggested to me but still not succesfull :(

That is because there is no "text" in ma(). Look at the link to your code in your first post. In that code ma() does not use "text" anywhere, it uses "data". The only difference is that now instead of using global variables you are passing "data" as an argument to ma(data) and inside ma(data) you create "vet".

mathew_31

deanhystad

mathew_31

deanhystad

mathew_31

deanhystad

mathew_31

deanhystad

mathew_31

deanhystad