Pyspark dataframe - Printable Version

Pyspark dataframe - Printable Version

+- Python Forum (https://python-forum.io)
+-- Forum: Python Coding (https://python-forum.io/forum-7.html)
+--- Forum: General Coding Help (https://python-forum.io/forum-8.html)
+--- Thread: Pyspark dataframe (/thread-39859.html)

Pyspark dataframe - siddhi1919 - Apr-24-2023

we have a requirement where we need to extract a value from a column of a dataframe and then we have to match this extracted value to another column of
entire same dataframe.

To visualize this we have a data frame

Col1 Col2 Col3 Col4
1 A 101 arn:aws:savingsplans::104:savingsplan/f001
2 B 102
3 C 103 arn:aws:savingsplans::101:savingsplan/f002
4 D 104

Here we have to pick col4 and extarct value example 104 and match with entire data set of col3.

RE: Pyspark dataframe - deanhystad - Apr-24-2023

What have you tried? It seems the solution is very straightforward, so maybe I don't understand what you are asking.

These are a couple of different ways to get a value in a dataframe

import pandas as pd

df = pd.DataFrame(
    {
        "Col1": range(1, 5),
        "Col2": list("ABCD"),
        "Col3": range(101, 105),
        "Col4": ("stuff", None, "stuff", None),
    }
)

print(df)
# Using the row index values: 0, 1, 2, 3
print('\ndf["Col3"][3]', df["Col3"][3], sep="\n")

# Reindexing to use Col2.  Get row "D", column "Col3"
temp = df.set_index("Col2")
print('\ntemp["Col3"]["D"]', temp["Col3"]["D"], sep="\n")

# As a matrix
print("\ndf.values[3][2]", df.values[3][2], sep="\n")

   Col1 Col2  Col3   Col4
0     1    A   101  stuff
1     2    B   102   None
2     3    C   103  stuff
3     4    D   104   None

df["Col3"][3]
104

temp["Col3"]["D"]
104

df.values[3][2]
104

The first two are the same. The only difference is in the second I changed the row index to use the values in "Col2". That lets ne get the value from Col3 that has "D" in Col2.

The third one uses DataFrame.values to get the values as a 2D numpy array. Then I use integer array indexing to get the value.

RE: Pyspark dataframe - siddhi1919 - Apr-25-2023

We are looking for a solution in pyspark where we can compare/match the one col4 value with entire table col3 value.

(Apr-24-2023, 09:17 PM)deanhystad Wrote: What have you tried? It seems the solution is very straightforward, so maybe I don't understand what you are asking.

These are a couple of different ways to get a value in a dataframe
import pandas as pd

df = pd.DataFrame(
    {
        "Col1": range(1, 5),
        "Col2": list("ABCD"),
        "Col3": range(101, 105),
        "Col4": ("stuff", None, "stuff", None),
    }
)

print(df)
# Using the row index values: 0, 1, 2, 3
print('\ndf["Col3"][3]', df["Col3"][3], sep="\n")

# Reindexing to use Col2.  Get row "D", column "Col3"
temp = df.set_index("Col2")
print('\ntemp["Col3"]["D"]', temp["Col3"]["D"], sep="\n")

# As a matrix
print("\ndf.values[3][2]", df.values[3][2], sep="\n")
   Col1 Col2  Col3   Col4
0     1    A   101  stuff
1     2    B   102   None
2     3    C   103  stuff
3     4    D   104   None

df["Col3"][3]
104

temp["Col3"]["D"]
104

df.values[3][2]
104
The first two are the same. The only difference is in the second I changed the row index to use the values in "Col2". That lets ne get the value from Col3 that has "D" in Col2.

The third one uses DataFrame.values to get the values as a 2D numpy array. Then I use integer array indexing to get the value.

RE: Pyspark dataframe - snippsat - Apr-25-2023

(Apr-25-2023, 07:16 AM)siddhi1919 Wrote: We are looking for a solution in pyspark where we can compare/match the one col4 value with entire table col3 value.

Next time if post give it a try with some code,to show some effort and not just post the the task to.
Something like this.

import pandas as pd

data = {
    'Col1': [1, 2, 3, 4],
    'Col2': ['A', 'B', 'C', 'D'],
    'Col3': [101, 102, 103, 104],
    'Col4': ['arn:aws:savingsplans::104:savingsplan/f001', '', 'arn:aws:savingsplans::101:savingsplan/f002', '']
}
df = pd.DataFrame(data)

# Use regex to extract 104 and 101 from Col4
df['Col4_extracted'] = df['Col4'].str.extract(r':(\d{3}):')
# Check if 104 appears in Col3
match = df['Col3'] == int(df['Col4_extracted'].iloc[0])
print(df['Col3'][match])

Output:3    104
Name: Col3, dtype: int64

Spark provides a createDataFrame(pandas_dataframe) method to convert pandas to Spark DataFrame.