![]() |
Pyspark dataframe - Printable Version +- Python Forum (https://python-forum.io) +-- Forum: Python Coding (https://python-forum.io/forum-7.html) +--- Forum: General Coding Help (https://python-forum.io/forum-8.html) +--- Thread: Pyspark dataframe (/thread-39859.html) |
Pyspark dataframe - siddhi1919 - Apr-24-2023 we have a requirement where we need to extract a value from a column of a dataframe and then we have to match this extracted value to another column of entire same dataframe. To visualize this we have a data frame Col1 Col2 Col3 Col4 1 A 101 arn:aws:savingsplans::104:savingsplan/f001 2 B 102 3 C 103 arn:aws:savingsplans::101:savingsplan/f002 4 D 104 Here we have to pick col4 and extarct value example 104 and match with entire data set of col3. RE: Pyspark dataframe - deanhystad - Apr-24-2023 What have you tried? It seems the solution is very straightforward, so maybe I don't understand what you are asking. These are a couple of different ways to get a value in a dataframe import pandas as pd df = pd.DataFrame( { "Col1": range(1, 5), "Col2": list("ABCD"), "Col3": range(101, 105), "Col4": ("stuff", None, "stuff", None), } ) print(df) # Using the row index values: 0, 1, 2, 3 print('\ndf["Col3"][3]', df["Col3"][3], sep="\n") # Reindexing to use Col2. Get row "D", column "Col3" temp = df.set_index("Col2") print('\ntemp["Col3"]["D"]', temp["Col3"]["D"], sep="\n") # As a matrix print("\ndf.values[3][2]", df.values[3][2], sep="\n") Col1 Col2 Col3 Col4 0 1 A 101 stuff 1 2 B 102 None 2 3 C 103 stuff 3 4 D 104 None df["Col3"][3] 104 temp["Col3"]["D"] 104 df.values[3][2] 104The first two are the same. The only difference is in the second I changed the row index to use the values in "Col2". That lets ne get the value from Col3 that has "D" in Col2. The third one uses DataFrame.values to get the values as a 2D numpy array. Then I use integer array indexing to get the value. RE: Pyspark dataframe - siddhi1919 - Apr-25-2023 We are looking for a solution in pyspark where we can compare/match the one col4 value with entire table col3 value. (Apr-24-2023, 09:17 PM)deanhystad Wrote: What have you tried? It seems the solution is very straightforward, so maybe I don't understand what you are asking. RE: Pyspark dataframe - snippsat - Apr-25-2023 (Apr-25-2023, 07:16 AM)siddhi1919 Wrote: We are looking for a solution in pyspark where we can compare/match the one col4 value with entire table col3 value.Next time if post give it a try with some code,to show some effort and not just post the the task to. Something like this. import pandas as pd data = { 'Col1': [1, 2, 3, 4], 'Col2': ['A', 'B', 'C', 'D'], 'Col3': [101, 102, 103, 104], 'Col4': ['arn:aws:savingsplans::104:savingsplan/f001', '', 'arn:aws:savingsplans::101:savingsplan/f002', ''] } df = pd.DataFrame(data) # Use regex to extract 104 and 101 from Col4 df['Col4_extracted'] = df['Col4'].str.extract(r':(\d{3}):') # Check if 104 appears in Col3 match = df['Col3'] == int(df['Col4_extracted'].iloc[0]) print(df['Col3'][match]) Spark provides a createDataFrame(pandas_dataframe) method to convert pandas to Spark DataFrame.
|