Posts: 5
Threads: 1
Joined: Aug 2022
Aug-14-2022, 07:24 PM
(This post was last modified: Aug-14-2022, 07:24 PM by Gribouillis.)
Hi forum I'm new at coding, I have created a code that reads in a .docx file. In that code I have a function that executes a search, assigns it to a variable and appends it to a list that then sends it to an excel. My problem is that the function is being called the number of .docx in a folder but this function runs the same word.docx instead of executing once for each word file. So my output is an excel with the same info twice (I have 2 word.docx files in the folder)
How do I fix this ? I have tried multiple codes without success. [attachment=1897][attachment=1897]
Attached Files
CAT.py (Size: 3.8 KB / Downloads: 236)
Posts: 6,779
Threads: 20
Joined: Feb 2020
Your problem is that first you do this:
for f in os.listdir(path):
if f.endswith('.docx'):
files.append(f)
for i in range(len(files)):
text = docx2txt.process(files[i])
text2 = text.replace(":", " ")
text3 = text2.replace(",", " ")
text4 = text3.replace("_", " ")
data = text4.split() Then later on you do this:
#Sends vet list to string
for j in data:
vet += j + ", " Was it your plan for vet to concatenate the results for all the files? That is not what happens. Your program only uses data from the last docx file. You should combine finding, processing and appending into one loop. Like this:
vet = ""
for f in os.listdir(path):
if f.endswith('.docx'):
text = docx2txt.process(f)
text = text.replace(":", " ")
text = text.replace(",", " ")
text = text.replace("_", " ")
data = text.split()
vet += ", ".join(data)
Posts: 5
Threads: 1
Joined: Aug 2022
Aug-14-2022, 11:32 PM
(This post was last modified: Aug-14-2022, 11:41 PM by Larz60+.)
Okay thanks, no it wasn't my plan. I am still learning. I have replaced my code with what you came up :) But now for some reason, my search function isn't working for the second word.docx. It is giving me this: see picture attachment. Instead of only the words " Neveu Transport" in the excel.
I am guessing their is something wrong with the " vet += ", ".join(data)" ?
thank you for you help.
vet = ""
for f in os.listdir(path):
if f.endswith('.docx'):
text = docx2txt.process(f)
text = text.replace(":", " ")
text = text.replace(",", " ")
text = text.replace("_", " ")
data = text.split()
vet += ", ".join(data) [/quote]
Larz60+ write Aug-14-2022, 11:41 PM:Please post all code, output and errors (it it's entirety) between their respective tags. Refer to BBCode help topic on how to post. Use the "Preview Post" button to make sure the code is presented as you expect before hitting the "Post Reply/Thread" button.
Attached Files
Thumbnail(s)
CAT - Copy 2.py (Size: 3.65 KB / Downloads: 94)
Posts: 6,779
Threads: 20
Joined: Feb 2020
Aug-15-2022, 02:06 AM
(This post was last modified: Aug-15-2022, 02:06 AM by deanhystad.)
That fixed the first error, only processing one file. There are more.
The next error is that all the docx results are appended to vet. Maybe you want to process each file independently? That would look like this:
for f in os.listdir(path):
if f.endswith('.docx'):
text = docx2txt.process(f).replace(":", " ").replace(",", " ").replace("_", " ")
ma(", ".join(text.split()))
def ma(vet):
... Please try to follow forum rules and post code by pasting into your post surrounded by Python tags.
Posts: 5
Threads: 1
Joined: Aug 2022
Great! Everything is working now:)
thank you so much
Posts: 6,779
Threads: 20
Joined: Feb 2020
Your data should be organized as a list of lists, not 12 independent lists. I would have ma() return a row (list) and data1 would be a list of rows.
Something like this:
def my_match(string, pattern):
"""Find pattern in string. Return first "group" stripped of commas"""
match = re.search(pattern, string)
if match:
return match.group(1).replace(",", "")
return ""
def ma(data):
vet = ", ".join(data)
return [
my_match('Transport, (.*)Contact', vet),
"",
data[0],
my_match('Date, (.*)Numéro', vet),
my_match('Prix, (.*)Prix', vet),
my_match('Compte, (.*)I.D', vet),
my_match('DBS, (.*)Transport', vet),
"",
"",
my_match('par, (.*)De', vet),
my_match('De, (.*)À', vet),
my_match('À, (.*)Attention', vet)
]
data1 = []
for f in os.listdir(path):
if f.endswith('.docx'):
text = docx2txt.process(f).replace(":", " ").replace(",", " ").replace("_", " ")
data1.append(ma(text.split()))
columns = [
"Transporteur",
"#Fournisseur",
"FT#",
"Date ceuillette",
"Prix",
"GL",
"PO#",
"IMACS/CC/W/O",
"Notes si requis",
"Transport demandé par",
"Origine",
"Destination",
]
df1 = pd.DataFrame(data1, columns) When you see yourself typing the same thing over and over:
result = re.search('Transport, (.*)Contact', vet)
result_1 = (result.group(1)).replace(",", "")
Tra = result_1 Write a function.
def my_match(string, pattern):
"""Find pattern in string. Return first "group" stripped of commas"""
match = re.search(pattern, string)
if match:
return match.group(1).replace(",", "")
return "" The function reduces typing and chances for typing errors. The function body makes it easy to document the important processing that you are repeating over and over. The function makes it easy to add functionality. Here I check if a match is found and return an empty string if it isn't.
mathew_31 likes this post
Posts: 5
Threads: 1
Joined: Aug 2022
Aug-22-2022, 06:04 PM
(This post was last modified: Aug-22-2022, 06:05 PM by mathew_31.)
So im running into a problem, I want to make this code run every 10 seconds for example. For some reason python doesn't recognise the value "text" when inserted in a function.
import os
import docx2txt
import re
import pandas as pd
import numpy as np
import openpyxl
import time
import schedule
#variables
path = r"C:\Users\eschbachm\OneDrive - EXP\Desktop\test"
os.chdir(path)
#Colonne total
col1 = []
col2 = []
col3 = []
col4 = []
col5 = []
col6 = []
col7 = []
col8 = []
col9 = []
col10 = []
col11 = []
col12 = []
#lists
vet = ""
#Sends vet list to string
for j in vet:
vet += j + ", "
def ma(vet):
#Colonne 1
result = re.search('Transport, (.*)Contact', vet)
result_1 = (result.group(1)).replace(",", "")
Tra = result_1
col1.append(Tra)
#Colonne 2
VQ = ''
col2.append(VQ)
#Colonne 3
result = re.search('(.*)LOCATION', vet)
result_3 = (result.group(1)).replace(",", "")
FT = result_3
col3.append(FT)
#Colonne 4
result = re.search('Date, (.*)Numéro', vet)
result_4 = (result.group(1)).replace(",", "")
Date = result_4
col4.append(Date)
#Colonne 5
result = re.search('Prix, (.*)Modèle', vet)
result_5 = (result.group(1)).replace(",", "")
Prix = result_5
col5.append(Prix)
#Colonne 6
result = re.search('Compte, (.*)Accessoires', vet) #recherche valeur de GL
result_6 = (result.group(1)).replace(",", "")
GL = result_6
#Colonne 7
result = re.search('DBS, (.*)Transport', vet) #recherche valeur de PO
result_7 = (result.group(1)).replace(",", "")
PO = result_7
a = 0
b = 0
c = 0
for line in text:
# checking string is present in line or not
if GL != "" and PO != "": #si Gl et PO sont present en meme temps
a = 1
elif GL != "": #si Gl est present et non PO
b = 2
elif PO != "": #si PO est present et non GL
c = 3
break
if a == 0:
pass
else: #si Gl et PO sont present en meme temps
col6.append(GL)
col7.append(PO)
if b == 0:
pass
else: #si Gl est present et non PO
col6.append(GL)
col7.append('')
if c == 0:
pass
else: #si PO est present et non GL
col6.append('')
col7.append(PO)
if a == 0 and b == 0 and c == 0:
col6.append('')
col7.append('')
#Colonne 8
IMACS = ''
col8.append(IMACS)
#Colonne 9
Notes = ''
col9.append(Notes)
#Colonne 10
result = re.search('par, (.*)De', vet) #recherche valeur de DP
result_10 = (result.group(1)).replace(",", "")
DP = result_10
col10.append(DP)
#Colonne 11
result = re.search('De, (.*)À', vet) #recherche valeur de origine
result_11 = (result.group(1)).replace(",", "")
ORI = result_11
col11.append(ORI)
#Colonne 12
result = re.search('À, (.*)Prix', vet) #recherche valeur de destination
result_12 = (result.group(1)).replace(",", "")
DEST = result_12
col12.append(DEST)
def run():
for f in os.listdir(path):
if f.endswith('.docx'):
text = docx2txt.process(f).replace(":", " ").replace(",", " ").replace("_", " ")
ma(", ".join(text.split()))
print('Transfert de donnée Réussi!')
# Creating the first Dataframe using dictionary
data1 = {
"Transporteur": col1,
"#Fournisseur": col2,
"FT#": col3,
"Date ceuillette": col4,
"Prix": col5,
"GL": col6,
"PO#": col7,
"IMACS/CC/W/O": col8,
"Notes si requis": col9,
"Transport demandé par": col10,
"Origine": col11,
"Destination": col12}
df1 = pd.DataFrame(data=data1)
df1 = df1.sort_values('Date ceuillette', ascending=True)
# load df to existing excel
with pd.ExcelWriter('output.xlsx', mode='a', if_sheet_exists="replace") as writer:
df1.to_excel(writer, sheet_name='Sheet_name1')
schedule.every(10).seconds.do(run)
while 1:
schedule.run_pending()
time.sleep(2) Python gives me this error:
Traceback (most recent call last):
File "C:\Users\eschbachm\OneDrive - EXP\Desktop\code\CAT - Version Finale.py", line 149, in <module>
run()
File "C:\Users\eschbachm\OneDrive - EXP\Desktop\code\CAT - Version Finale.py", line 123, in run
ma(", ".join(text.split()))
File "C:\Users\eschbachm\OneDrive - EXP\Desktop\code\CAT - Version Finale.py", line 69, in ma
for line in text:
NameError: name 'text' is not defined. Did you mean: 'next'?
Posts: 6,779
Threads: 20
Joined: Feb 2020
Complaining that text is not defined is a valid complaint. There is no variable named "text" defined in ma() or in the global scope. There is a variable named "text" defined in run(), but that variable, like all local variables in a function, is not visible outside run().
Looking at you initial code, ma() used "data" and "vet". You got data by doing this:
text = docx2txt.process(files[i])
text2 = text.replace(":", " ")
text3 = text2.replace(",", " ")
text4 = text3.replace("_", " ")
data = text4.split() which is the same as:
text = docx2txt.process(f).replace(":", " ").replace(",", " ").replace("_", " ") data = text.split()
And you got vet by doing this:
vet = ""
for j in data:
vet += j + ", " which is the same as this:
", ".join(text.split()) Since ma() needs both data and vet, and vet is easily created from data, I think it makes more sense to pass data to ma() and have ma() create vet.
def ma(data):
vet = ", ".join(data)
...
def run():
for f in os.listdir(path):
if f.endswith('.docx'):
text = docx2txt.process(f).replace(":", " ").replace(",", " ").replace("_", " ")
ma(text.split()) # This creates data from text and passes it to ma
print('Transfert de donnée Réussi!')
Posts: 5
Threads: 1
Joined: Aug 2022
Aug-22-2022, 08:33 PM
(This post was last modified: Aug-22-2022, 08:33 PM by mathew_31.)
Hi thanks alot for helping me in your free time,
I am still not understanding this:
def ma(data):
vet = ", ".join(data)
...
for line in text:
# checking string is present in line or not
...
def run():
for f in os.listdir(path):
if f.endswith('.docx'):
text = docx2txt.process(f).replace(":", " ").replace(",", " ").replace("_", " ")
ma(text.split())
print('Transfert de donnée Réussi!') python is still throwing me a code because "text" in ma() is not defined.
How can I link the two "text" variables in both functions?
I have made the changes you suggested to me but still not succesfull :(
Posts: 6,779
Threads: 20
Joined: Feb 2020
That is because there is no "text" in ma(). Look at the link to your code in your first post. In that code ma() does not use "text" anywhere, it uses "data". The only difference is that now instead of using global variables you are passing "data" as an argument to ma(data) and inside ma(data) you create "vet".
|