Posts: 5
Threads: 1
Joined: Sep 2023
Sep-05-2023, 03:03 PM
(This post was last modified: Sep-05-2023, 04:30 PM by visedwings049.)
I am using Python 3.9
I have created two pandas data frames from csv files
product and supplier
1. I have created a product table that splits a description out into multiple columns.
2. I have created a supplier table that has the supplier product. Many times a supplier product code is in the description of a product.
3. I want to populate the product.supplier code column with any string that is contained in the supplier.product column. In this example we would have found the code widget in column 4 in the supplier table and returned the word widget in the supplier code table.
4. So i want to do this on a loop for column 0 then move on to column 1 and so on.
There will never be two examples of a supplier code in the same string, so i am not worried about overwriting a first instance with a second.
I have tried the str.contains function but this just returns true or false.
Attached Files
Thumbnail(s)
Posts: 6,826
Threads: 20
Joined: Feb 2020
What is a table? Is it a spreadsheet? Is it a table in a PDF? Is it a CSV file? Is it a pandas DataFrame, Is it a table in a database?
Posts: 5
Threads: 1
Joined: Sep 2023
(Sep-05-2023, 03:03 PM)visedwings049 Wrote: I am using Python 3.9
1. I have created a product table that splits a description out into multiple columns.
2. I have created a supplier table that has the supplier product. Many times a supplier product code is in the description of a product.
3. I want to populate the product.supplier code column with any string that is contained in the supplier.product column. In this example we would have found the code widget in column 4 in the supplier table and returned the word widget in the supplier code table.
4. So i want to do this on a loop for column 0 then move on to column 1 and so on.
There will never be two examples of a supplier code in the same string, so i am not worried about overwriting a first instance with a second.
I have tried the str.contains function but this just returns true or false.
![[Image: parse-example.png]](https://i.ibb.co/SNywm1L/parse-example.png)
I apologize as this is my first post but they are both being read as pandas dataframes from CSV.
Posts: 6,826
Threads: 20
Joined: Feb 2020
Sep-05-2023, 06:57 PM
(This post was last modified: Sep-05-2023, 06:57 PM by deanhystad.)
Something like this maybe?
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 |
import pandas as pd
from string import ascii_letters as letters
from random import choice, choices, randint
def find_supplier(description):
intersection = set (description.split()) & suppliers
return list (intersection)[ 0 ] if intersection else None
product_table = pd.DataFrame(
[
{
"Product" : i,
"Supplier Code" : choice( "ABCDE" ),
"Description" : " " .join(choices(letters, k = randint( 5 , 10 ))),
}
for i in range ( 100 , 120 )
]
)
suppliers = set (product_table[ "Supplier Code" ].values)
supplier_table = product_table[[ "Description" ]]
supplier_table[ "Product" ] = supplier_table[ "Description" ]. map (find_supplier)
supplier_table = supplier_table[~supplier_table[ "Product" ].isna()][
[ "Product" , "Description" ]
]
print (supplier_table)
|
Output: Product Description
1 E T u V x Z E a k s
2 A K H a K P G z m l A
3 E Q L H E q J
5 B N X x b i B q D F M
8 D d U q K Y W I D
10 C U V H C f F n N z
14 C C o X u J
15 E D F e E Q u
18 B o f B P x O
This is easy to break up into individual supplier tables.
1 2 3 4 5 6 7 |
for supplier in suppliers:
print (
supplier,
supplier_table[supplier_table[ "Product" ] = = supplier].reset_index(drop = True ),
sep = "\n" ,
end = "\n\n" ,
)
|
Output: A
Product Description
0 A d j A c S o F U
1 A o A w I W
2 A z s e j A
3 A c P R w Z M A V b
D
Product Description
0 D t u P r p R v G D O
1 D j P w D h v o m w
C
Product Description
0 C n j r C r R T
B
Product Description
0 B H O P B A c r C n
1 B B g Z r z
2 B r o y g u l B A
E
Product Description
0 E P E m t Z
1 E S E Y m F a K h Z T
Posts: 5
Threads: 1
Joined: Sep 2023
Wow that is an amazing way to do it. This is very helpful i am going to play with this code and see if i can duplicate it with my data set. Thank you so much for this example i had not done anything with defining a function prior to this and that is extremely cool.
Posts: 5
Threads: 1
Joined: Sep 2023
(Sep-05-2023, 04:21 PM)visedwings049 Wrote: (Sep-05-2023, 03:03 PM)visedwings049 Wrote: I am using Python 3.9
1. I have created a product table that splits a description out into multiple columns.
2. I have created a supplier table that has the supplier product. Many times a supplier product code is in the description of a product.
3. I want to populate the product.supplier code column with any string that is contained in the supplier.product column. In this example we would have found the code widget in column 4 in the supplier table and returned the word widget in the supplier code table.
4. So i want to do this on a loop for column 0 then move on to column 1 and so on.
There will never be two examples of a supplier code in the same string, so i am not worried about overwriting a first instance with a second.
I have tried the str.contains function but this just returns true or false.
![[Image: parse-example.png]](https://i.ibb.co/SNywm1L/parse-example.png)
I apologize as this is my first post but they are both being read as pandas dataframes from CSV.
Worked beautifully and its way faster than sending everything to a different column. Had no idea you could do this in python. I was able to compare thousands of rows of data in less than 30 seconds. Thanks Again!!
Posts: 5
Threads: 1
Joined: Sep 2023
(Sep-05-2023, 06:57 PM)deanhystad Wrote: Something like this maybe?
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 |
import pandas as pd
from string import ascii_letters as letters
from random import choice, choices, randint
def find_supplier(description):
intersection = set (description.split()) & suppliers
return list (intersection)[ 0 ] if intersection else None
product_table = pd.DataFrame(
[
{
"Product" : i,
"Supplier Code" : choice( "ABCDE" ),
"Description" : " " .join(choices(letters, k = randint( 5 , 10 ))),
}
for i in range ( 100 , 120 )
]
)
suppliers = set (product_table[ "Supplier Code" ].values)
supplier_table = product_table[[ "Description" ]]
supplier_table[ "Product" ] = supplier_table[ "Description" ]. map (find_supplier)
supplier_table = supplier_table[~supplier_table[ "Product" ].isna()][
[ "Product" , "Description" ]
]
print (supplier_table)
|
Output: Product Description
1 E T u V x Z E a k s
2 A K H a K P G z m l A
3 E Q L H E q J
5 B N X x b i B q D F M
8 D d U q K Y W I D
10 C U V H C f F n N z
14 C C o X u J
15 E D F e E Q u
18 B o f B P x O
This is easy to break up into individual supplier tables.
1 2 3 4 5 6 7 |
for supplier in suppliers:
print (
supplier,
supplier_table[supplier_table[ "Product" ] = = supplier].reset_index(drop = True ),
sep = "\n" ,
end = "\n\n" ,
)
|
Output: A
Product Description
0 A d j A c S o F U
1 A o A w I W
2 A z s e j A
3 A c P R w Z M A V b
D
Product Description
0 D t u P r p R v G D O
1 D j P w D h v o m w
C
Product Description
0 C n j r C r R T
B
Product Description
0 B H O P B A c r C n
1 B B g Z r z
2 B r o y g u l B A
E
Product Description
0 E P E m t Z
1 E S E Y m F a K h Z T
Worked beautifully and its way faster than sending everything to a different column. Had no idea you could do this in python. I was able to compare thousands of rows of data in less than 30 seconds. Thanks Again!!
Posts: 6,826
Threads: 20
Joined: Feb 2020
Thousands of rows in 30 seconds is really slow. I modified my code to process 100,000 products and it did that in 0.1 seconds. So I doubled the number of suppliers, and that only increased the time about about 10% (0.11 seconds). Next I tripled the length of the description, and that doubled the time (0.22 seconds).
What could be making your program run so slow?
Posts: 6,826
Threads: 20
Joined: Feb 2020
I was getting a warning that I ignored in my code. It doesn't cause any problems in my example, but it has potential for causing very confusing behaviors.
In my example I did this to make a new dataframe for suppliers:
1 |
supplier_table = product_table[[ "Description" ]]
|
This does not create a new dataframe. It creates a slice of the product_table dataframe. What I should have done is make a copy of that slice so that supplier_table and product_table are independent.
1 |
supplier_table = product_table[[ "Description" ]].copy()
|
|