Jan-14-2022, 08:59 PM
Hello Community,
I have coded the following logic into SQL as follows:
Join very_large_dataframe to small_product_dimension_dataframe on column [B]
Only join records to small_product_dimension_dataframe where O is greater then 10
Keep only Column [P]
SELECT
small_product_dimension_dataframe.P
FROM dbo.small_product_dimension_dataframe
INNER JOIN dbo.very_large_dataframe
ON small_product_dimension_dataframe.B = very_large_dataframe.B
WHERE small_product_dimension_dataframe.O > 10
I would like help with the equivalent code in PySpark.
I have made a start withn the following:
I have coded the following logic into SQL as follows:
Join very_large_dataframe to small_product_dimension_dataframe on column [B]
Only join records to small_product_dimension_dataframe where O is greater then 10
Keep only Column [P]
SELECT
small_product_dimension_dataframe.P
FROM dbo.small_product_dimension_dataframe
INNER JOIN dbo.very_large_dataframe
ON small_product_dimension_dataframe.B = very_large_dataframe.B
WHERE small_product_dimension_dataframe.O > 10
I would like help with the equivalent code in PySpark.
I have made a start withn the following:
df = very_large_dataframe.join(small_product_dimension_dataframe, (very_large_dataframe.B == small_product_dimension_dataframe.B))I would like help amending the pyspark to include col P and WHERE small_product_dimension_dataframe.O > 10