Jul-11-2019, 12:41 AM
Hello. Any ideas what's going on here? Full disclosure...I am relatively new to Python/Spark programming.
I'm using the following code within Spark to look for common fourgrams within two sets of data (Spark dataframes), df_grouped_s and df_grouped_c:
1) dfIntersectC_S = df_grouped_c.select('unique_fourgrams_grouped_c').intersect(df_grouped_s.select('unique_fourgrams_grouped_s'))
2) Next, I am attempting to output dfIntersectC_S Spark dataframe to a csv file:
dfIntersectCAAP_SAR.write.csv('/collab/crisk/nlpta_poc/workspace/fourgrams_caap_sar.csv')
I receive an error message:
321 raise Py4JError(
Py4JJavaError: An error occurred while calling o22720.csv.
: java.lang.UnsupportedOperationException: CSV data source does not support array<array<string>> data type.
at org.apache.spark.sql.execution.datasources.csv.CSVFileFormat.org$apache$spark$sql$execution$datasources$csv$CSVFileFormat$$verifyType$1(CSVFileFormat.scala:233)
at org.apache.spark.sql.execution.datasources.csv.CSVFileFormat$$anonfun$verifySchema$1.apply(CSVFileFormat.scala:237)
at org.apache.spark.sql.execution.datasources.csv.CSVFileFormat$$anonfun$verifySchema$1.apply(CSVFileFormat.scala:237)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
at org.apache.spark.sql.types.StructType.foreach(StructType.scala:96)
at org.apache.spark.sql.execution.datasources.csv.CSVFileFormat.verifySchema(CSVFileFormat.scala:237)
at org.apache.spark.sql.execution.datasources.csv.CSVFileFormat.prepareWrite(CSVFileFormat.scala:121)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:108)
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:101)
I'm using the following code within Spark to look for common fourgrams within two sets of data (Spark dataframes), df_grouped_s and df_grouped_c:
1) dfIntersectC_S = df_grouped_c.select('unique_fourgrams_grouped_c').intersect(df_grouped_s.select('unique_fourgrams_grouped_s'))
2) Next, I am attempting to output dfIntersectC_S Spark dataframe to a csv file:
dfIntersectCAAP_SAR.write.csv('/collab/crisk/nlpta_poc/workspace/fourgrams_caap_sar.csv')
I receive an error message:
321 raise Py4JError(
Py4JJavaError: An error occurred while calling o22720.csv.
: java.lang.UnsupportedOperationException: CSV data source does not support array<array<string>> data type.
at org.apache.spark.sql.execution.datasources.csv.CSVFileFormat.org$apache$spark$sql$execution$datasources$csv$CSVFileFormat$$verifyType$1(CSVFileFormat.scala:233)
at org.apache.spark.sql.execution.datasources.csv.CSVFileFormat$$anonfun$verifySchema$1.apply(CSVFileFormat.scala:237)
at org.apache.spark.sql.execution.datasources.csv.CSVFileFormat$$anonfun$verifySchema$1.apply(CSVFileFormat.scala:237)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
at org.apache.spark.sql.types.StructType.foreach(StructType.scala:96)
at org.apache.spark.sql.execution.datasources.csv.CSVFileFormat.verifySchema(CSVFileFormat.scala:237)
at org.apache.spark.sql.execution.datasources.csv.CSVFileFormat.prepareWrite(CSVFileFormat.scala:121)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:108)
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:101)