Pyspark dataframe

siddhi1919 · Apr-24-2023, 07:48 PM

we have a requirement where we need to extract a value from a column of a dataframe and then we have to match this extracted value to another column of
entire same dataframe.

To visualize this we have a data frame

Col1 Col2 Col3 Col4
1 A 101 arn:aws:savingsplans::104:savingsplan/f001
2 B 102
3 C 103 arn:aws:savingsplans::101:savingsplan/f002
4 D 104

Here we have to pick col4 and extarct value example 104 and match with entire data set of col3.

**deanhystad** · Apr-24-2023, 09:17 PM

What have you tried? It seems the solution is very straightforward, so maybe I don't understand what you are asking.

These are a couple of different ways to get a value in a dataframe

import pandas as pd

df = pd.DataFrame(
    {
        "Col1": range(1, 5),
        "Col2": list("ABCD"),
        "Col3": range(101, 105),
        "Col4": ("stuff", None, "stuff", None),
    }
)

print(df)
# Using the row index values: 0, 1, 2, 3
print('\ndf["Col3"][3]', df["Col3"][3], sep="\n")

# Reindexing to use Col2.  Get row "D", column "Col3"
temp = df.set_index("Col2")
print('\ntemp["Col3"]["D"]', temp["Col3"]["D"], sep="\n")

# As a matrix
print("\ndf.values[3][2]", df.values[3][2], sep="\n")

   Col1 Col2  Col3   Col4
0     1    A   101  stuff
1     2    B   102   None
2     3    C   103  stuff
3     4    D   104   None

df["Col3"][3]
104

temp["Col3"]["D"]
104

df.values[3][2]
104

The first two are the same. The only difference is in the second I changed the row index to use the values in "Col2". That lets ne get the value from Col3 that has "D" in Col2.

The third one uses DataFrame.values to get the values as a 2D numpy array. Then I use integer array indexing to get the value.

siddhi1919 · Apr-25-2023, 07:16 AM

We are looking for a solution in pyspark where we can compare/match the one col4 value with entire table col3 value.

(Apr-24-2023, 09:17 PM)deanhystad Wrote: What have you tried? It seems the solution is very straightforward, so maybe I don't understand what you are asking.

These are a couple of different ways to get a value in a dataframe
import pandas as pd

df = pd.DataFrame(
    {
        "Col1": range(1, 5),
        "Col2": list("ABCD"),
        "Col3": range(101, 105),
        "Col4": ("stuff", None, "stuff", None),
    }
)

print(df)
# Using the row index values: 0, 1, 2, 3
print('\ndf["Col3"][3]', df["Col3"][3], sep="\n")

# Reindexing to use Col2.  Get row "D", column "Col3"
temp = df.set_index("Col2")
print('\ntemp["Col3"]["D"]', temp["Col3"]["D"], sep="\n")

# As a matrix
print("\ndf.values[3][2]", df.values[3][2], sep="\n")
   Col1 Col2  Col3   Col4
0     1    A   101  stuff
1     2    B   102   None
2     3    C   103  stuff
3     4    D   104   None

df["Col3"][3]
104

temp["Col3"]["D"]
104

df.values[3][2]
104
The first two are the same. The only difference is in the second I changed the row index to use the values in "Col2". That lets ne get the value from Col3 that has "D" in Col2.

The third one uses DataFrame.values to get the values as a 2D numpy array. Then I use integer array indexing to get the value.

***snippsat*** · (This post was last modified: Apr-25-2023, 12:39 PM by snippsat.)

(Apr-25-2023, 07:16 AM)siddhi1919 Wrote: We are looking for a solution in pyspark where we can compare/match the one col4 value with entire table col3 value.

Next time if post give it a try with some code,to show some effort and not just post the the task to.
Something like this.

import pandas as pd

data = {
    'Col1': [1, 2, 3, 4],
    'Col2': ['A', 'B', 'C', 'D'],
    'Col3': [101, 102, 103, 104],
    'Col4': ['arn:aws:savingsplans::104:savingsplan/f001', '', 'arn:aws:savingsplans::101:savingsplan/f002', '']
}
df = pd.DataFrame(data)

# Use regex to extract 104 and 101 from Col4
df['Col4_extracted'] = df['Col4'].str.extract(r':(\d{3}):')
# Check if 104 appears in Col3
match = df['Col3'] == int(df['Col4_extracted'].iloc[0])
print(df['Col3'][match])

Output:3    104
Name: Col3, dtype: int64

Spark provides a createDataFrame(pandas_dataframe) method to convert pandas to Spark DataFrame.

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	PySpark Coding Challenge	cpatte7372	4	9,243	Jun-25-2023, 12:56 PM Last Post: prajwal_0078
	pyspark help	lokesh	0	1,255	Jan-03-2023, 04:34 PM Last Post: lokesh
	How to iterate Groupby in Python/PySpark	DrData82	2	4,088	Feb-05-2022, 09:59 PM Last Post: DrData82
	PySpark Equivalent Code	cpatte7372	0	1,791	Jan-14-2022, 08:59 PM Last Post: cpatte7372
	Pyspark - my code works but I want to make it better	Kevin	1	2,463	Dec-01-2021, 05:04 AM Last Post: Kevin
	pyspark parallel write operation not working	aliyesami	1	2,497	Oct-16-2021, 05:18 PM Last Post: aliyesami
	pyspark creating temp files in /tmp folder	aliyesami	1	7,303	Oct-16-2021, 05:15 PM Last Post: aliyesami
	KafkaUtils module not found on spark 3 pyspark	aupres	2	9,067	Feb-17-2021, 09:40 AM Last Post: Larz60+
	pyspark dataframe to json without header	vijz	0	2,640	Nov-28-2020, 05:36 PM Last Post: vijz
	Pyspark SQL Error - mismatched input 'FROM' expecting <EOF>	Ariean	3	53,468	Nov-20-2020, 03:49 PM Last Post: Ariean

Pyspark dataframe

User Panel Messages

Announcements