Python Forum
Search Excel File with a list of values
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Search Excel File with a list of values
#1
I'm trying to search for values from a list of values within a csv file. That CSV comes from the system which contains details about the machine that was read the values from. The machine has 2 parts, 1 which has the department name where the machine is located and another is the task that said machine does. Eg,

Machining & Robotics Dept., ID:C1878/23, ........other sensor reads

This information will be contained within the csv along with a ton of other reads from the sensors. I want to file these csv's with the machine part as the file name by matching the contents in the csv file with a central excel file that contains just the department names and the machine part ID, both of which must match to be filed under said part name.

How do I match the contents of the csv file to that of the excel file with a whole lot of other information that come in the csv file?
Reply
#2
I would read the excel spreadsheet and extract a list of department, machine ID strings. I would format the strings to match the format of the department, machine ID in the csv files. Something like this:
df = pd.read_excel("machines.xlsx")
machine_names = [
    f"{dpt}, {id}" for dpt, id in zip(df["Department"].values, df["Machine"].values)
]
print(machine_names)
I assumed the spreasheet filename is "machines.xlsx", that it has column headers, and that the column headers for the department and machine ID columns are "Department" and "Machine". If the spreadsheet doesn't have column headers you will use index values 0, 1, 2 for spreadsheet columns A, B, C, like this:
df = pd.read_excel("machines.xlsx", header=None)
machine_names = [f"{dpt}, {id}" for dpt, id in zip(df[0].values, df[1].values)]
Now you can use this list to check for matches in the CSV files. Something like this:
csv_folder = '.'
for filename in Path(csv_folder).glob("*.csv"):
    with open(filename, "r") as file:
        values = file.readline().split(",")
    if len(values) > 1:
        machine_name = f"{values[0].strip()}, {values[1].strip()}"
        if machine_name in machine_names:
            print(machine_name)   # Match found.  Do your processing here
This code assumes all the csv files will be in some folder ("." is the current working directory) and that the files all have the extensions ".csv". It opens the file, splits the first line into columns using a comma as the delimiter, and makes a string using the department name and machine ID. This format must match the format of the strings constructed from the excel spreadsheet.

To determine if the CSV file is a valid machine, test if the csv machine string matches any of the machine strings from the spreadsheet.
Reply
#3
(Oct-27-2023, 06:17 PM)deanhystad Wrote: I would read the excel spreadsheet and extract a list of department, machine ID strings. I would format the strings to match the format of the department, machine ID in the csv files. Something like this:
df = pd.read_excel("machines.xlsx")
machine_names = [
    f"{dpt}, {id}" for dpt, id in zip(df["Department"].values, df["Machine"].values)
]
print(machine_names)
I assumed the spreasheet filename is "machines.xlsx", that it has column headers, and that the column headers for the department and machine ID columns are "Department" and "Machine". If the spreadsheet doesn't have column headers you will use index values 0, 1, 2 for spreadsheet columns A, B, C, like this:
df = pd.read_excel("machines.xlsx", header=None)
machine_names = [f"{dpt}, {id}" for dpt, id in zip(df[0].values, df[1].values)]
Now you can use this list to check for matches in the CSV files. Something like this:
csv_folder = '.'
for filename in Path(csv_folder).glob("*.csv"):
    with open(filename, "r") as file:
        values = file.readline().split(",")
    if len(values) > 1:
        machine_name = f"{values[0].strip()}, {values[1].strip()}"
        if machine_name in machine_names:
            print(machine_name)   # Match found.  Do your processing here
This code assumes all the csv files will be in some folder ("." is the current working directory) and that the files all have the extensions ".csv". It opens the file, splits the first line into columns using a comma as the delimiter, and makes a string using the department name and machine ID. This format must match the format of the strings constructed from the excel spreadsheet.

To determine if the CSV file is a valid machine, test if the csv machine string matches any of the machine strings from the spreadsheet.

Thanks for the response. I'm finding the script taking a lot of time creating pairs (I'm assuming that's what is happening with machine_names) with 16000 entries. I'm not sure if this approach would work, but I was thinking of filtering the dataframe based on the occurrence of the word from the wordlist. So there will be 2 filters that will need to be applied, first being, filtering based on department name and the second based on machine id. I got the first filter correct, where I was able to filter the dataframe based on the department name, but I'm finding problem filtering the next based on the machine id. I know the cause as the csv seems to be having random strings and values attached with the machine ID. I was thinking of using regex, but I need to do the opposite of the first filter, by matching the values of ID column from the dataframe with the wordlist. How do I go about doing the reverse?
Reply
#4
I don't understand. Your excel spreadsheet with the department and machine ID has 16000 rows? That is a lot of machines! But even if there are 16000 machines, I measured it only takes 1 second to make all the "department, machine" strings and that only happens 1 time. After making all the names it can start scanning the machine data files and checking if the department, machine is valid.

I was hoping you could split the csv file into columns, combine the first two, and check if the resulting string exists int he department, machine list. Now it sounds like you need to do the reverse and check if an entry in the department, machine list matches start of the CSV file. Maybe something like this:
import pandas as pd

df = pd.read_excel("data.xlsx")
machine_names = [
    f"{dpt}, {id}" for dpt, id in zip(df["Department"].values, df["Machine"].values)
]

csv_folder = '.'
for filename in Path(csv_folder).glob("*.csv"):
    with open(filename, "r") as file:
        line = file.readline()
        # Can we find a machine that matches the start of the csv file?
        for machine in machine_names:
            if line.startswith(machine):
                print(line)   # Match found.  Do your processing here
                break
It may be helpful if you posted a short example of the excel file and a couple sample csv files.
Reply
#5
I was thinking about filtering the dataframe columns using the wordlist. That way, I wouldn't have to create a tuple of all values from the excel. I got 1 filter, the department name down using the below code:

for line in res:
                x, y = line
                wordlist.append(y[0].lower())
                df = dframe[dframe['depName'].str.lower().isin (wordlist)]
What I have now are again the [wordlist] and a filtered df. The next filter, is to match the machine ID to that in CVS. However, in the CSV:

[inline]kvadsun,"te6659,3416652","w/h:1,block b,alqou-3",email:,"p.o282062,dub",[email protected],keep_trct·__223/dustrialds_buildhardware,inv_validCode,cm no.,: c23/d00134,…....,/////////alph,: 01/09/2023,vareg:0575413,sm id,: s-12,vareg,003455795___,nitpr
[/inline]

The 10th value has special character " : " in the machine ID. The current filtered df reads below:

erpNum mch_id depName
488 HO5106 259 Keep_Trct
530 HO5129 267 Keep_Trct
770 HO5290 391 Keep_Trct
975 HO5431 C23/D00134 Keep_Trct
977 HO5432 I23/D00448 Keep_Trct

Sorry I'm not sure how to add a table, but in the above filtered table, the value in mch_id @ 975 doesn't match because of the special characters. So, I wanted to use the above mch_id values as regex pattern to match against the csv. Ideally, I'd like to filter both using the regex, but the above code was a T&E to see if filtering worked. But the next filter is difficult.
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  Python openyxl not updating Excel file MrBean12 1 342 Mar-03-2024, 12:16 AM
Last Post: MrBean12
  Copy Paste excel files based on the first letters of the file name Viento 2 453 Feb-07-2024, 12:24 PM
Last Post: Viento
  Copying the order of another list with identical values gohanhango 7 1,169 Nov-29-2023, 09:17 PM
Last Post: Pedroski55
  Updating sharepoint excel file odd results cubangt 1 854 Nov-03-2023, 05:13 PM
Last Post: noisefloor
  trouble reading string/module from excel as a list popular_dog 0 432 Oct-04-2023, 01:07 PM
Last Post: popular_dog
  Search for multiple unknown 3 (2) Byte combinations in a file. lastyle 7 1,374 Aug-14-2023, 02:28 AM
Last Post: deanhystad
  Search Outlook Inbox for set of values cubangt 1 1,103 Jun-28-2023, 09:29 PM
Last Post: cubangt
  Comparing List values to get indexes Edward_ 7 1,179 Jun-09-2023, 04:57 PM
Last Post: deanhystad
  Reading data from excel file –> process it >>then write to another excel output file Jennifer_Jone 0 1,114 Mar-14-2023, 07:59 PM
Last Post: Jennifer_Jone
  search file by regex SamLiu 1 920 Feb-23-2023, 01:19 PM
Last Post: deanhystad

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020