Python Forum
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Pyspark dataframe
#1
we have a requirement where we need to extract a value from a column of a dataframe and then we have to match this extracted value to another column of
entire same dataframe.

To visualize this we have a data frame

Col1 Col2 Col3 Col4
1 A 101 arn:aws:savingsplans::104:savingsplan/f001
2 B 102
3 C 103 arn:aws:savingsplans::101:savingsplan/f002
4 D 104

Here we have to pick col4 and extarct value example 104 and match with entire data set of col3.
Reply
#2
What have you tried? It seems the solution is very straightforward, so maybe I don't understand what you are asking.

These are a couple of different ways to get a value in a dataframe
import pandas as pd

df = pd.DataFrame(
    {
        "Col1": range(1, 5),
        "Col2": list("ABCD"),
        "Col3": range(101, 105),
        "Col4": ("stuff", None, "stuff", None),
    }
)

print(df)
# Using the row index values: 0, 1, 2, 3
print('\ndf["Col3"][3]', df["Col3"][3], sep="\n")

# Reindexing to use Col2.  Get row "D", column "Col3"
temp = df.set_index("Col2")
print('\ntemp["Col3"]["D"]', temp["Col3"]["D"], sep="\n")

# As a matrix
print("\ndf.values[3][2]", df.values[3][2], sep="\n")
   Col1 Col2  Col3   Col4
0     1    A   101  stuff
1     2    B   102   None
2     3    C   103  stuff
3     4    D   104   None

df["Col3"][3]
104

temp["Col3"]["D"]
104

df.values[3][2]
104
The first two are the same. The only difference is in the second I changed the row index to use the values in "Col2". That lets ne get the value from Col3 that has "D" in Col2.

The third one uses DataFrame.values to get the values as a 2D numpy array. Then I use integer array indexing to get the value.
Reply
#3
We are looking for a solution in pyspark where we can compare/match the one col4 value with entire table col3 value.





(Apr-24-2023, 09:17 PM)deanhystad Wrote: What have you tried? It seems the solution is very straightforward, so maybe I don't understand what you are asking.

These are a couple of different ways to get a value in a dataframe
import pandas as pd

df = pd.DataFrame(
    {
        "Col1": range(1, 5),
        "Col2": list("ABCD"),
        "Col3": range(101, 105),
        "Col4": ("stuff", None, "stuff", None),
    }
)

print(df)
# Using the row index values: 0, 1, 2, 3
print('\ndf["Col3"][3]', df["Col3"][3], sep="\n")

# Reindexing to use Col2.  Get row "D", column "Col3"
temp = df.set_index("Col2")
print('\ntemp["Col3"]["D"]', temp["Col3"]["D"], sep="\n")

# As a matrix
print("\ndf.values[3][2]", df.values[3][2], sep="\n")
   Col1 Col2  Col3   Col4
0     1    A   101  stuff
1     2    B   102   None
2     3    C   103  stuff
3     4    D   104   None

df["Col3"][3]
104

temp["Col3"]["D"]
104

df.values[3][2]
104
The first two are the same. The only difference is in the second I changed the row index to use the values in "Col2". That lets ne get the value from Col3 that has "D" in Col2.

The third one uses DataFrame.values to get the values as a 2D numpy array. Then I use integer array indexing to get the value.
Reply
#4
(Apr-25-2023, 07:16 AM)siddhi1919 Wrote: We are looking for a solution in pyspark where we can compare/match the one col4 value with entire table col3 value.
Next time if post give it a try with some code,to show some effort and not just post the the task to.
Something like this.
import pandas as pd

data = {
    'Col1': [1, 2, 3, 4],
    'Col2': ['A', 'B', 'C', 'D'],
    'Col3': [101, 102, 103, 104],
    'Col4': ['arn:aws:savingsplans::104:savingsplan/f001', '', 'arn:aws:savingsplans::101:savingsplan/f002', '']
}
df = pd.DataFrame(data)

# Use regex to extract 104 and 101 from Col4
df['Col4_extracted'] = df['Col4'].str.extract(r':(\d{3}):')
# Check if 104 appears in Col3
match = df['Col3'] == int(df['Col4_extracted'].iloc[0])
print(df['Col3'][match])
Output:
3 104 Name: Col3, dtype: int64
Spark provides a createDataFrame(pandas_dataframe) method to convert pandas to Spark DataFrame.
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  PySpark Coding Challenge cpatte7372 4 6,113 Jun-25-2023, 12:56 PM
Last Post: prajwal_0078
  pyspark help lokesh 0 762 Jan-03-2023, 04:34 PM
Last Post: lokesh
  How to iterate Groupby in Python/PySpark DrData82 2 2,848 Feb-05-2022, 09:59 PM
Last Post: DrData82
  PySpark Equivalent Code cpatte7372 0 1,269 Jan-14-2022, 08:59 PM
Last Post: cpatte7372
  Pyspark - my code works but I want to make it better Kevin 1 1,797 Dec-01-2021, 05:04 AM
Last Post: Kevin
  pyspark parallel write operation not working aliyesami 1 1,705 Oct-16-2021, 05:18 PM
Last Post: aliyesami
  pyspark creating temp files in /tmp folder aliyesami 1 5,026 Oct-16-2021, 05:15 PM
Last Post: aliyesami
  KafkaUtils module not found on spark 3 pyspark aupres 2 7,398 Feb-17-2021, 09:40 AM
Last Post: Larz60+
  pyspark dataframe to json without header vijz 0 1,964 Nov-28-2020, 05:36 PM
Last Post: vijz
  Pyspark SQL Error - mismatched input 'FROM' expecting <EOF> Ariean 3 48,112 Nov-20-2020, 03:49 PM
Last Post: Ariean

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020