Python Forum
Find a string from a column of one table in another table
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Find a string from a column of one table in another table
#1
I am using Python 3.9

I have created two pandas data frames from csv files
product and supplier


1. I have created a product table that splits a description out into multiple columns.

2. I have created a supplier table that has the supplier product. Many times a supplier product code is in the description of a product.

3. I want to populate the product.supplier code column with any string that is contained in the supplier.product column. In this example we would have found the code widget in column 4 in the supplier table and returned the word widget in the supplier code table.

4. So i want to do this on a loop for column 0 then move on to column 1 and so on.

There will never be two examples of a supplier code in the same string, so i am not worried about overwriting a first instance with a second.

I have tried the str.contains function but this just returns true or false.

[Image: parse-example.png]

Attached Files

Thumbnail(s)
   
Reply
#2
What is a table? Is it a spreadsheet? Is it a table in a PDF? Is it a CSV file? Is it a pandas DataFrame, Is it a table in a database?
Reply
#3
(Sep-05-2023, 03:03 PM)visedwings049 Wrote: I am using Python 3.9

1. I have created a product table that splits a description out into multiple columns.

2. I have created a supplier table that has the supplier product. Many times a supplier product code is in the description of a product.

3. I want to populate the product.supplier code column with any string that is contained in the supplier.product column. In this example we would have found the code widget in column 4 in the supplier table and returned the word widget in the supplier code table.

4. So i want to do this on a loop for column 0 then move on to column 1 and so on.

There will never be two examples of a supplier code in the same string, so i am not worried about overwriting a first instance with a second.

I have tried the str.contains function but this just returns true or false.

[Image: parse-example.png]

I apologize as this is my first post but they are both being read as pandas dataframes from CSV.
Reply
#4
Something like this maybe?
import pandas as pd
from string import ascii_letters as letters
from random import choice, choices, randint


def find_supplier(description):
    """Return word if word in description matches a supplier code, else None."""
    intersection = set(description.split()) & suppliers
    return list(intersection)[0] if intersection else None


# Make some random table thing that we can use to search for words in the description
# that match a supplier code.
product_table = pd.DataFrame(
    [
        {
            "Product": i,
            "Supplier Code": choice("ABCDE"),
            "Description": " ".join(choices(letters, k=randint(5, 10))),
        }
        for i in range(100, 120)
    ]
)

# Get set of suppliers.
suppliers = set(product_table["Supplier Code"].values)

# Make supplier table.  Supplier table contains rows from product_table
# where one of the words in the description matches a supplier code.
supplier_table = product_table[["Description"]]
supplier_table["Product"] = supplier_table["Description"].map(find_supplier)
supplier_table = supplier_table[~supplier_table["Product"].isna()][
    ["Product", "Description"]
]
print(supplier_table)
Output:
Product Description 1 E T u V x Z E a k s 2 A K H a K P G z m l A 3 E Q L H E q J 5 B N X x b i B q D F M 8 D d U q K Y W I D 10 C U V H C f F n N z 14 C C o X u J 15 E D F e E Q u 18 B o f B P x O
This is easy to break up into individual supplier tables.
for supplier in suppliers:
    print(
        supplier,
        supplier_table[supplier_table["Product"] == supplier].reset_index(drop=True),
        sep="\n",
        end="\n\n",
    )
Output:
A Product Description 0 A d j A c S o F U 1 A o A w I W 2 A z s e j A 3 A c P R w Z M A V b D Product Description 0 D t u P r p R v G D O 1 D j P w D h v o m w C Product Description 0 C n j r C r R T B Product Description 0 B H O P B A c r C n 1 B B g Z r z 2 B r o y g u l B A E Product Description 0 E P E m t Z 1 E S E Y m F a K h Z T
Reply
#5
Wow that is an amazing way to do it. This is very helpful i am going to play with this code and see if i can duplicate it with my data set. Thank you so much for this example i had not done anything with defining a function prior to this and that is extremely cool. Cool
Reply
#6
(Sep-05-2023, 04:21 PM)visedwings049 Wrote:
(Sep-05-2023, 03:03 PM)visedwings049 Wrote: I am using Python 3.9

1. I have created a product table that splits a description out into multiple columns.

2. I have created a supplier table that has the supplier product. Many times a supplier product code is in the description of a product.

3. I want to populate the product.supplier code column with any string that is contained in the supplier.product column. In this example we would have found the code widget in column 4 in the supplier table and returned the word widget in the supplier code table.

4. So i want to do this on a loop for column 0 then move on to column 1 and so on.

There will never be two examples of a supplier code in the same string, so i am not worried about overwriting a first instance with a second.

I have tried the str.contains function but this just returns true or false.

[Image: parse-example.png]

I apologize as this is my first post but they are both being read as pandas dataframes from CSV.

Worked beautifully and its way faster than sending everything to a different column. Had no idea you could do this in python. I was able to compare thousands of rows of data in less than 30 seconds. Thanks Again!!
Reply
#7
(Sep-05-2023, 06:57 PM)deanhystad Wrote: Something like this maybe?
import pandas as pd
from string import ascii_letters as letters
from random import choice, choices, randint


def find_supplier(description):
    """Return word if word in description matches a supplier code, else None."""
    intersection = set(description.split()) & suppliers
    return list(intersection)[0] if intersection else None


# Make some random table thing that we can use to search for words in the description
# that match a supplier code.
product_table = pd.DataFrame(
    [
        {
            "Product": i,
            "Supplier Code": choice("ABCDE"),
            "Description": " ".join(choices(letters, k=randint(5, 10))),
        }
        for i in range(100, 120)
    ]
)

# Get set of suppliers.
suppliers = set(product_table["Supplier Code"].values)

# Make supplier table.  Supplier table contains rows from product_table
# where one of the words in the description matches a supplier code.
supplier_table = product_table[["Description"]]
supplier_table["Product"] = supplier_table["Description"].map(find_supplier)
supplier_table = supplier_table[~supplier_table["Product"].isna()][
    ["Product", "Description"]
]
print(supplier_table)
Output:
Product Description 1 E T u V x Z E a k s 2 A K H a K P G z m l A 3 E Q L H E q J 5 B N X x b i B q D F M 8 D d U q K Y W I D 10 C U V H C f F n N z 14 C C o X u J 15 E D F e E Q u 18 B o f B P x O
This is easy to break up into individual supplier tables.
for supplier in suppliers:
    print(
        supplier,
        supplier_table[supplier_table["Product"] == supplier].reset_index(drop=True),
        sep="\n",
        end="\n\n",
    )
Output:
A Product Description 0 A d j A c S o F U 1 A o A w I W 2 A z s e j A 3 A c P R w Z M A V b D Product Description 0 D t u P r p R v G D O 1 D j P w D h v o m w C Product Description 0 C n j r C r R T B Product Description 0 B H O P B A c r C n 1 B B g Z r z 2 B r o y g u l B A E Product Description 0 E P E m t Z 1 E S E Y m F a K h Z T
Worked beautifully and its way faster than sending everything to a different column. Had no idea you could do this in python. I was able to compare thousands of rows of data in less than 30 seconds. Thanks Again!!
Reply
#8
Thousands of rows in 30 seconds is really slow. I modified my code to process 100,000 products and it did that in 0.1 seconds. So I doubled the number of suppliers, and that only increased the time about about 10% (0.11 seconds). Next I tripled the length of the description, and that doubled the time (0.22 seconds).

What could be making your program run so slow?
Reply
#9
I was getting a warning that I ignored in my code. It doesn't cause any problems in my example, but it has potential for causing very confusing behaviors.

In my example I did this to make a new dataframe for suppliers:
supplier_table = product_table[["Description"]]
This does not create a new dataframe. It creates a slice of the product_table dataframe. What I should have done is make a copy of that slice so that supplier_table and product_table are independent.
supplier_table = product_table[["Description"]].copy()
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  drawing a table with the status of tasks in each thread pyfoo 3 424 Mar-01-2024, 09:29 AM
Last Post: nerdyaks
  How to create a table with different sizes of columns in MS word pepe 8 1,585 Dec-08-2023, 07:31 PM
Last Post: Pedroski55
  Trying to get counts/sum/percentages from pandas similar to pivot table cubangt 6 1,426 Oct-06-2023, 04:32 PM
Last Post: cubangt
  dict table kucingkembar 4 741 Sep-30-2023, 03:53 PM
Last Post: deanhystad
  Going through HTML table with selenium emont 3 817 Sep-30-2023, 02:13 AM
Last Post: emont
Thumbs Up Convert word into pdf and copy table to outlook body in a prescribed format email2kmahe 1 763 Sep-22-2023, 02:33 PM
Last Post: carecavoador
  Using pyodbc&pandas to load a Table data to df tester_V 3 830 Sep-09-2023, 08:55 PM
Last Post: tester_V
Question Using SQLAlchemy, prevent SQLite3 table update by multiple program instances Calab 3 760 Aug-09-2023, 05:51 PM
Last Post: Calab
  Color a table cell based on specific text Creepy 11 2,013 Jul-27-2023, 02:48 PM
Last Post: deanhystad
Information Showing trendline formula in a table per product Carlossxx 0 672 May-03-2023, 08:34 AM
Last Post: Carlossxx

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020