Python Forum
pyspark sql unable to recognize SQL query command
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
pyspark sql unable to recognize SQL query command
#1
down vote
favorite

Community,
I have written the following pyspark.sql query. However, pyspark doesn't appear to recognize the SQL query 'TOP 20 PERCENT'.

#%%
import findspark
findspark.init('/home/packt/spark-2.1.0-bin-hadoop2.7')
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('ops').getOrCreate()
df = spark.read.csv('/home/packt/Downloads/Spark_DataFrames/HumanResources_Employee.csv',inferSchema=True,header=True)
df.createOrReplaceTempView('HumanResources_Employee')
myresults = spark.sql("""SELECT TOP 20 PERCENT
  NationalIDNumber
 ,JobTitle
 ,BirthDate
FROM HumanResources_Employee""")
myresults.show()
As you can see from the results below, pyspark isn't able to recognize the number '20'
Can you let me know if I have to reformat the number '20'

Error:
Append ResultsClear Results --------------------------------------------------------------------------- Py4JJavaError Traceback (most recent call last) ~/spark-2.1.0-bin-hadoop2.7/python/pyspark/sql/utils.py in deco(*a, **kw) 62 try: ---> 63 return f(*a, **kw) 64 except py4j.protocol.Py4JJavaError as e: ~/spark-2.1.0-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name) 318 "An error occurred while calling {0}{1}{2}.\n". --> 319 format(target_id, ".", name), value) 320 else: Py4JJavaError: An error occurred while calling o19.sql. : org.apache.spark.sql.catalyst.parser.ParseException: extraneous input '20' expecting {<EOF>, '(', ',', '.', '[', 'SELECT', 'FROM', 'ADD', 'AS', 'ALL', 'DISTINCT', 'WHERE', 'GROUP', 'BY', 'GROUPING', 'SETS', 'CUBE', 'ROLLUP', 'ORDER', 'HAVING', 'LIMIT', 'AT', 'OR', 'AND', 'IN', NOT, 'NO', 'EXISTS', 'BETWEEN', 'LIKE', RLIKE, 'IS', 'NULL', 'TRUE', 'FALSE', 'NULLS', 'ASC', 'DESC', 'FOR', 'INTERVAL', 'CASE', 'WHEN', 'THEN', 'ELSE', 'END', 'JOIN', 'CROSS', 'OUTER', 'INNER', 'LEFT', 'SEMI', 'RIGHT', 'FULL', 'NATURAL', 'ON', 'LATERAL', 'WINDOW', 'OVER', 'PARTITION', 'RANGE', 'ROWS', 'UNBOUNDED', 'PRECEDING', 'FOLLOWING', 'CURRENT', 'FIRST', 'LAST', 'ROW', 'WITH', 'VALUES', 'CREATE', 'TABLE', 'VIEW', 'REPLACE', 'INSERT', 'DELETE', 'INTO', 'DESCRIBE', 'EXPLAIN', 'FORMAT', 'LOGICAL', 'CODEGEN', 'CAST', 'SHOW', 'TABLES', 'COLUMNS', 'COLUMN', 'USE', 'PARTITIONS', 'FUNCTIONS', 'DROP', 'UNION', 'EXCEPT', 'MINUS', 'INTERSECT', 'TO', 'TABLESAMPLE', 'STRATIFY', 'ALTER', 'RENAME', 'ARRAY', 'MAP', 'STRUCT', 'COMMENT', 'SET', 'RESET', 'DATA', 'START', 'TRANSACTION', 'COMMIT', 'ROLLBACK', 'MACRO', 'IF', EQ, '<=>', '<>', '!=', '<', LTE, '>', GTE, '+', '-', '*', '/', '%', 'DIV', '&', '|', '^', 'PERCENT', 'BUCKET', 'OUT', 'OF', 'SORT', 'CLUSTER', 'DISTRIBUTE', 'OVERWRITE', 'TRANSFORM', 'REDUCE', 'USING', 'SERDE', 'SERDEPROPERTIES', 'RECORDREADER', 'RECORDWRITER', 'DELIMITED', 'FIELDS', 'TERMINATED', 'COLLECTION', 'ITEMS', 'KEYS', 'ESCAPED', 'LINES', 'SEPARATED', 'FUNCTION', 'EXTENDED', 'REFRESH', 'CLEAR', 'CACHE', 'UNCACHE', 'LAZY', 'FORMATTED', 'GLOBAL', TEMPORARY, 'OPTIONS', 'UNSET', 'TBLPROPERTIES', 'DBPROPERTIES', 'BUCKETS', 'SKEWED', 'STORED', 'DIRECTORIES', 'LOCATION', 'EXCHANGE', 'ARCHIVE', 'UNARCHIVE', 'FILEFORMAT', 'TOUCH', 'COMPACT', 'CONCATENATE', 'CHANGE', 'CASCADE', 'RESTRICT', 'CLUSTERED', 'SORTED', 'PURGE', 'INPUTFORMAT', 'OUTPUTFORMAT', DATABASE, DATABASES, 'DFS', 'TRUNCATE', 'ANALYZE', 'COMPUTE', 'LIST', 'STATISTICS', 'PARTITIONED', 'EXTERNAL', 'DEFINED', 'REVOKE', 'GRANT', 'LOCK', 'UNLOCK', 'MSCK', 'REPAIR', 'RECOVER', 'EXPORT', 'IMPORT', 'LOAD', 'ROLE', 'ROLES', 'COMPACTIONS', 'PRINCIPALS', 'TRANSACTIONS', 'INDEX', 'INDEXES', 'LOCKS', 'OPTION', 'ANTI', 'LOCAL', 'INPATH', 'CURRENT_DATE', 'CURRENT_TIMESTAMP', IDENTIFIER, BACKQUOTED_IDENTIFIER}(line 1, pos 11) == SQL == SELECT TOP 20 PERCENT -----------^^^ NationalIDNumber ,JobTitle ,BirthDate FROM HumanResources_Employee at org.apache.spark.sql.catalyst.parser.ParseException.withCommand(ParseDriver.scala:197) at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:99) at org.apache.spark.sql.execution.SparkSqlParser.parse(SparkSqlParser.scala:45) at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parsePlan(ParseDriver.scala:53) at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:592) at sun.reflect.GeneratedMethodAccessor69.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at py4j.Gateway.invoke(Gateway.java:280) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:214) at java.lang.Thread.run(Thread.java:748) During handling of the above exception, another exception occurred: ParseException Traceback (most recent call last) <ipython-input-74-b6642cf94ff4> in <module>() 10 ,JobTitle 11 ,BirthDate ---> 12 FROM HumanResources_Employee""") 13 myresults.show() ~/spark-2.1.0-bin-hadoop2.7/python/pyspark/sql/session.py in sql(self, sqlQuery) 539 [Row(f1=1, f2=u'row1'), Row(f1=2, f2=u'row2'), Row(f1=3, f2=u'row3')] 540 """ --> 541 return DataFrame(self._jsparkSession.sql(sqlQuery), self._wrapped) 542 543 @since(2.0) ~/spark-2.1.0-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py in __call__(self, *args) 1131 answer = self.gateway_client.send_command(command) 1132 return_value = get_return_value( -> 1133 answer, self.gateway_client, self.target_id, self.name) 1134 1135 for temp_arg in temp_args: ~/spark-2.1.0-bin-hadoop2.7/python/pyspark/sql/utils.py in deco(*a, **kw) 71 raise AnalysisException(s.split(': ', 1)[1], stackTrace) 72 if s.startswith('org.apache.spark.sql.catalyst.parser.ParseException: '): ---> 73 raise ParseException(s.split(': ', 1)[1], stackTrace) 74 if s.startswith('org.apache.spark.sql.streaming.StreamingQueryException: '): 75 raise StreamingQueryException(s.split(': ', 1)[1], stackTrace)
As you can see from above the problem appear to be with the number 20.
Any thoughts will be greatly appreciated.

Carlton

Hello Community,

Can someone let me know how to get email notifications when someone has replied to my question?

For some reason email notifications aren't coming through
Reply
#2
Crosspost of https://stackoverflow.com/questions/5159...ry-command
Reply
#3
In response to your question - I would guess that you should be using LIMIT instead of TOP.
Reply
#4
Hi micseydel,

Thanks for reaching out.

I tried LIMIT instead of TOP, but it didn't work.

I got the following error:

Py4JJavaError: An error occurred while calling o19.sql.
: org.apache.spark.sql.catalyst.parser.ParseException: 
extraneous input '20' expecting {<EOF>, '(', ',', '.', '[', 'SELECT', 'FROM', 'ADD', 'AS', 'ALL', 'DISTINCT', 'WHERE', 'GROUP', 'BY', 'GROUPING', 'SETS', 'CUBE', 'ROLLUP', 'ORDER', 'HAVING', 'LIMIT', 'AT', 'OR', 'AND', 'IN', NOT, 'NO', 'EXISTS', 'BETWEEN', 'LIKE', RLIKE, 'IS', 'NULL', 'TRUE', 'FALSE', 'NULLS', 'ASC', 'DESC', 'FOR', 'INTERVAL', 'CASE', 'WHEN', 'THEN', 'ELSE', 'END', 'JOIN', 'CROSS', 'OUTER', 'INNER', 'LEFT', 'SEMI', 'RIGHT', 'FULL', 'NATURAL', 'ON', 'LATERAL', 'WINDOW', 'OVER', 'PARTITION', 'RANGE', 'ROWS', 'UNBOUNDED', 'PRECEDING', 'FOLLOWING', 'CURRENT', 'FIRST', 'LAST', 'ROW', 'WITH', 'VALUES', 'CREATE', 'TABLE', 'VIEW', 'REPLACE', 'INSERT', 'DELETE', 'INTO', 'DESCRIBE', 'EXPLAIN', 'FORMAT', 'LOGICAL', 'CODEGEN', 'CAST', 'SHOW', 'TABLES', 'COLUMNS', 'COLUMN', 'USE', 'PARTITIONS', 'FUNCTIONS', 'DROP', 'UNION', 'EXCEPT', 'MINUS', 'INTERSECT', 'TO', 'TABLESAMPLE', 'STRATIFY', 'ALTER', 'RENAME', 'ARRAY', 'MAP', 'STRUCT', 'COMMENT', 'SET', 'RESET', 'DATA', 'START', 'TRANSACTION', 'COMMIT', 'ROLLBACK', 'MACRO', 'IF', EQ, '<=>', '<>', '!=', '<', LTE, '>', GTE, '+', '-', '*', '/', '%', 'DIV', '&', '|', '^', 'PERCENT', 'BUCKET', 'OUT', 'OF', 'SORT', 'CLUSTER', 'DISTRIBUTE', 'OVERWRITE', 'TRANSFORM', 'REDUCE', 'USING', 'SERDE', 'SERDEPROPERTIES', 'RECORDREADER', 'RECORDWRITER', 'DELIMITED', 'FIELDS', 'TERMINATED', 'COLLECTION', 'ITEMS', 'KEYS', 'ESCAPED', 'LINES', 'SEPARATED', 'FUNCTION', 'EXTENDED', 'REFRESH', 'CLEAR', 'CACHE', 'UNCACHE', 'LAZY', 'FORMATTED', 'GLOBAL', TEMPORARY, 'OPTIONS', 'UNSET', 'TBLPROPERTIES', 'DBPROPERTIES', 'BUCKETS', 'SKEWED', 'STORED', 'DIRECTORIES', 'LOCATION', 'EXCHANGE', 'ARCHIVE', 'UNARCHIVE', 'FILEFORMAT', 'TOUCH', 'COMPACT', 'CONCATENATE', 'CHANGE', 'CASCADE', 'RESTRICT', 'CLUSTERED', 'SORTED', 'PURGE', 'INPUTFORMAT', 'OUTPUTFORMAT', DATABASE, DATABASES, 'DFS', 'TRUNCATE', 'ANALYZE', 'COMPUTE', 'LIST', 'STATISTICS', 'PARTITIONED', 'EXTERNAL', 'DEFINED', 'REVOKE', 'GRANT', 'LOCK', 'UNLOCK', 'MSCK', 'REPAIR', 'RECOVER', 'EXPORT', 'IMPORT', 'LOAD', 'ROLE', 'ROLES', 'COMPACTIONS', 'PRINCIPALS', 'TRANSACTIONS', 'INDEX', 'INDEXES', 'LOCKS', 'OPTION', 'ANTI', 'LOCAL', 'INPATH', 'CURRENT_DATE', 'CURRENT_TIMESTAMP', IDENTIFIER, BACKQUOTED_IDENTIFIER}(line 1, pos 13)

== SQL ==
SELECT LIMIT 20 PERCENT
-------------^^^
  NationalIDNumber
 ,JobTitle
 ,BirthDate
FROM HumanResources_Employee
Any further suggestions will be very much welcomed.

Hello community,

I don't think the problem is with using either TOP or LIMIT. I think the problem is the number 20. But I don't know how to reformat a number?

Any thoughts greatly appreciated.

Cheers
Reply
#5
Apologies, I could have been more clear. The LIMIT clause is different from TOP, it comes at the end rather than the beginning. If I'm right, you should be reading up on your SQL dialect rather than anything specific to PySpark. That might make your testing a lot easier as well.
Reply
#6
micseydel,

Thanks for sticking with me on this issue.

I have made the code as easy as possible. The code now looks like this:

df = spark.read.csv('/home/packt/Downloads/Spark_DataFrames/HumanResources_Employee.csv',inferSchema=True,header=True)
df.registerTempTable("HumanResources_Employee")
df = spark.read.csv('/home/packt/Downloads/Spark_DataFrames/HumanResources_Employee.csv',inferSchema=True,header=True)
df.createOrReplaceTempView('HumanResources_Employee')
myresults = spark.sql("""SELECT TOP 20 * FROM HumanResources_Employee""")
myresults.show()
I do believe the TOP keyword should work with pyspark.sql

However, I'm getting the error:

Py4JJavaError: An error occurred while calling o19.sql.
: org.apache.spark.sql.catalyst.parser.ParseException: 
extraneous input '20' expecting {<EOF>, '(', ',', '.', '[', 'SELECT', 'FROM', 'ADD', 'AS', 'ALL', 'DISTINCT', 'WHERE', 'GROUP', 'BY', 'GROUPING', 'SETS', 'CUBE', 'ROLLUP', 'ORDER', 'HAVING', 'LIMIT', 'AT', 'OR', 'AND', 'IN', NOT, 'NO', 'EXISTS', 'BETWEEN', 'LIKE', RLIKE, 'IS', 'NULL', 'TRUE', 'FALSE', 'NULLS', 'ASC', 'DESC', 'FOR', 'INTERVAL', 'CASE', 'WHEN', 'THEN', 'ELSE', 'END', 'JOIN', 'CROSS', 'OUTER', 'INNER', 'LEFT', 'SEMI', 'RIGHT', 'FULL', 'NATURAL', 'ON', 'LATERAL', 'WINDOW', 'OVER', 'PARTITION', 'RANGE', 'ROWS', 'UNBOUNDED', 'PRECEDING', 'FOLLOWING', 'CURRENT', 'FIRST', 'LAST', 'ROW', 'WITH', 'VALUES', 'CREATE', 'TABLE', 'VIEW', 'REPLACE', 'INSERT', 'DELETE', 'INTO', 'DESCRIBE', 'EXPLAIN', 'FORMAT', 'LOGICAL', 'CODEGEN', 'CAST', 'SHOW', 'TABLES', 'COLUMNS', 'COLUMN', 'USE', 'PARTITIONS', 'FUNCTIONS', 'DROP', 'UNION', 'EXCEPT', 'MINUS', 'INTERSECT', 'TO', 'TABLESAMPLE', 'STRATIFY', 'ALTER', 'RENAME', 'ARRAY', 'MAP', 'STRUCT', 'COMMENT', 'SET', 'RESET', 'DATA', 'START', 'TRANSACTION', 'COMMIT', 'ROLLBACK', 'MACRO', 'IF', EQ, '<=>', '<>', '!=', '<', LTE, '>', GTE, '+', '-', '*', '/', '%', 'DIV', '&', '|', '^', 'PERCENT', 'BUCKET', 'OUT', 'OF', 'SORT', 'CLUSTER', 'DISTRIBUTE', 'OVERWRITE', 'TRANSFORM', 'REDUCE', 'USING', 'SERDE', 'SERDEPROPERTIES', 'RECORDREADER', 'RECORDWRITER', 'DELIMITED', 'FIELDS', 'TERMINATED', 'COLLECTION', 'ITEMS', 'KEYS', 'ESCAPED', 'LINES', 'SEPARATED', 'FUNCTION', 'EXTENDED', 'REFRESH', 'CLEAR', 'CACHE', 'UNCACHE', 'LAZY', 'FORMATTED', 'GLOBAL', TEMPORARY, 'OPTIONS', 'UNSET', 'TBLPROPERTIES', 'DBPROPERTIES', 'BUCKETS', 'SKEWED', 'STORED', 'DIRECTORIES', 'LOCATION', 'EXCHANGE', 'ARCHIVE', 'UNARCHIVE', 'FILEFORMAT', 'TOUCH', 'COMPACT', 'CONCATENATE', 'CHANGE', 'CASCADE', 'RESTRICT', 'CLUSTERED', 'SORTED', 'PURGE', 'INPUTFORMAT', 'OUTPUTFORMAT', DATABASE, DATABASES, 'DFS', 'TRUNCATE', 'ANALYZE', 'COMPUTE', 'LIST', 'STATISTICS', 'PARTITIONED', 'EXTERNAL', 'DEFINED', 'REVOKE', 'GRANT', 'LOCK', 'UNLOCK', 'MSCK', 'REPAIR', 'RECOVER', 'EXPORT', 'IMPORT', 'LOAD', 'ROLE', 'ROLES', 'COMPACTIONS', 'PRINCIPALS', 'TRANSACTIONS', 'INDEX', 'INDEXES', 'LOCKS', 'OPTION', 'ANTI', 'LOCAL', 'INPATH', 'CURRENT_DATE', 'CURRENT_TIMESTAMP', IDENTIFIER, BACKQUOTED_IDENTIFIER}(line 1, pos 11)

== SQL ==
SELECT TOP 20 * FROM HumanResources_Employee
-----------^^^
Any thoughts most welcomed

micseydel,

I re-read your post and I think I understand what you're saying. I changed the code to;

myresults = spark.sql("SELECT * FROM HumanResources_Employee LIMIT 20")

And it worked. Can you show me how to show 20 percent, instead of just 20 rows?
Reply
#7
(Jul-31-2018, 08:20 AM)cpatte7372 Wrote: And it worked. Can you show me how to show 20 percent, instead of just 20 rows?
This isn't really a Python question, but rather a SQL question for your specific dialect. Different SQLs have different ways of doing things. You should be able to find an answer to this by Googling, but if not, you need to provide your SQL version here or we'd have to guess at the solution.
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  python cant recognize PIP siubikYT 2 704 Jul-19-2023, 06:30 PM
Last Post: Lahearle
  PySpark Coding Challenge cpatte7372 4 5,959 Jun-25-2023, 12:56 PM
Last Post: prajwal_0078
  Pyspark dataframe siddhi1919 3 1,183 Apr-25-2023, 12:39 PM
Last Post: snippsat
  pyspark help lokesh 0 736 Jan-03-2023, 04:34 PM
Last Post: lokesh
  Is it possible to make a program recognize how many clicks it has had on the monitor? jao 0 1,126 Feb-25-2022, 06:31 PM
Last Post: jao
  How to iterate Groupby in Python/PySpark DrData82 2 2,705 Feb-05-2022, 09:59 PM
Last Post: DrData82
  PySpark Equivalent Code cpatte7372 0 1,227 Jan-14-2022, 08:59 PM
Last Post: cpatte7372
  My program won't recognize the filename. braingoblins 1 1,095 Jan-07-2022, 06:18 PM
Last Post: deanhystad
  Pyspark - my code works but I want to make it better Kevin 1 1,745 Dec-01-2021, 05:04 AM
Last Post: Kevin
  pyspark parallel write operation not working aliyesami 1 1,651 Oct-16-2021, 05:18 PM
Last Post: aliyesami

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020