Feb-17-2022, 12:18 PM
Hi guys!
I create a spark dataframe:
schema = StructType([
StructField("escolaridade", StringType(), False),
StructField("estado_civil", StringType(), False),
StructField("salario", DoubleType(), False),
StructField("total_acessos", IntegerType(), False)
])
df = spark.createDataFrame(pd_df, schema)
where pd_df is a pandas dataframe.
In the method bellow;
def v_col_escola(s):
if s == 'Basico':
return 0.0
elif s == 'Graduacao':
return 1.0
else:
return -1.0
rot = UserDefinedFunction(v_col_escola, DoubleType())
ldata = df.select(rot(col('escolaridade')).alias('escolaridade'), col('estado_civil')).where('escolaridade >= 0')
When I try read de new dataframe (ldata):
ldata.take(1)
Py4JJavaError Traceback (most recent call last)
<ipython-input-102-83ca7fdb585c> in <module>
----> 1 labeledData.take(1)
~\anaconda3\envs\curso_pandas\lib\site-packages\pyspark\sql\dataframe.py in take(self, num)
502 [Row(age=2, name=u'Alice'), Row(age=5, name=u'Bob')]
503 """
--> 504 return self.limit(num).collect()
505
506 @since(1.3)
~\anaconda3\envs\curso_pandas\lib\site-packages\pyspark\sql\dataframe.py in collect(self)
464 """
465 with SCCallSiteSync(self._sc) as css:
--> 466 sock_info = self._jdf.collectToPython()
467 return list(_load_from_socket(sock_info, BatchedSerializer(PickleSerializer())))
Can anyone help with this? I am using Jupyter in Anaconda 2.1.1...
I create a spark dataframe:
schema = StructType([
StructField("escolaridade", StringType(), False),
StructField("estado_civil", StringType(), False),
StructField("salario", DoubleType(), False),
StructField("total_acessos", IntegerType(), False)
])
df = spark.createDataFrame(pd_df, schema)
where pd_df is a pandas dataframe.
In the method bellow;
def v_col_escola(s):
if s == 'Basico':
return 0.0
elif s == 'Graduacao':
return 1.0
else:
return -1.0
rot = UserDefinedFunction(v_col_escola, DoubleType())
ldata = df.select(rot(col('escolaridade')).alias('escolaridade'), col('estado_civil')).where('escolaridade >= 0')
When I try read de new dataframe (ldata):
ldata.take(1)
Py4JJavaError Traceback (most recent call last)
<ipython-input-102-83ca7fdb585c> in <module>
----> 1 labeledData.take(1)
~\anaconda3\envs\curso_pandas\lib\site-packages\pyspark\sql\dataframe.py in take(self, num)
502 [Row(age=2, name=u'Alice'), Row(age=5, name=u'Bob')]
503 """
--> 504 return self.limit(num).collect()
505
506 @since(1.3)
~\anaconda3\envs\curso_pandas\lib\site-packages\pyspark\sql\dataframe.py in collect(self)
464 """
465 with SCCallSiteSync(self._sc) as css:
--> 466 sock_info = self._jdf.collectToPython()
467 return list(_load_from_socket(sock_info, BatchedSerializer(PickleSerializer())))
Can anyone help with this? I am using Jupyter in Anaconda 2.1.1...