pyspark sql unable to recognize SQL query command - Printable Version +- Python Forum (https://python-forum.io) +-- Forum: Python Coding (https://python-forum.io/forum-7.html) +--- Forum: General Coding Help (https://python-forum.io/forum-8.html) +--- Thread: pyspark sql unable to recognize SQL query command (/thread-11888.html) |
pyspark sql unable to recognize SQL query command - cpatte7372 - Jul-30-2018 down vote favorite Community, I have written the following pyspark.sql query. However, pyspark doesn't appear to recognize the SQL query 'TOP 20 PERCENT'. #%% import findspark findspark.init('/home/packt/spark-2.1.0-bin-hadoop2.7') from pyspark.sql import SparkSession spark = SparkSession.builder.appName('ops').getOrCreate() df = spark.read.csv('/home/packt/Downloads/Spark_DataFrames/HumanResources_Employee.csv',inferSchema=True,header=True) df.createOrReplaceTempView('HumanResources_Employee') myresults = spark.sql("""SELECT TOP 20 PERCENT NationalIDNumber ,JobTitle ,BirthDate FROM HumanResources_Employee""") myresults.show()As you can see from the results below, pyspark isn't able to recognize the number '20' Can you let me know if I have to reformat the number '20' As you can see from above the problem appear to be with the number 20. Any thoughts will be greatly appreciated. Carlton Hello Community, Can someone let me know how to get email notifications when someone has replied to my question? For some reason email notifications aren't coming through RE: pyspark sql unable to recognize SQL query command - micseydel - Jul-30-2018 Crosspost of https://stackoverflow.com/questions/51593576/pyspark-sql-unable-to-recognize-sql-query-command RE: pyspark sql unable to recognize SQL query command - micseydel - Jul-30-2018 In response to your question - I would guess that you should be using LIMIT instead of TOP .
RE: pyspark sql unable to recognize SQL query command - cpatte7372 - Jul-30-2018 Hi micseydel, Thanks for reaching out. I tried LIMIT instead of TOP, but it didn't work. I got the following error: Py4JJavaError: An error occurred while calling o19.sql. : org.apache.spark.sql.catalyst.parser.ParseException: extraneous input '20' expecting {<EOF>, '(', ',', '.', '[', 'SELECT', 'FROM', 'ADD', 'AS', 'ALL', 'DISTINCT', 'WHERE', 'GROUP', 'BY', 'GROUPING', 'SETS', 'CUBE', 'ROLLUP', 'ORDER', 'HAVING', 'LIMIT', 'AT', 'OR', 'AND', 'IN', NOT, 'NO', 'EXISTS', 'BETWEEN', 'LIKE', RLIKE, 'IS', 'NULL', 'TRUE', 'FALSE', 'NULLS', 'ASC', 'DESC', 'FOR', 'INTERVAL', 'CASE', 'WHEN', 'THEN', 'ELSE', 'END', 'JOIN', 'CROSS', 'OUTER', 'INNER', 'LEFT', 'SEMI', 'RIGHT', 'FULL', 'NATURAL', 'ON', 'LATERAL', 'WINDOW', 'OVER', 'PARTITION', 'RANGE', 'ROWS', 'UNBOUNDED', 'PRECEDING', 'FOLLOWING', 'CURRENT', 'FIRST', 'LAST', 'ROW', 'WITH', 'VALUES', 'CREATE', 'TABLE', 'VIEW', 'REPLACE', 'INSERT', 'DELETE', 'INTO', 'DESCRIBE', 'EXPLAIN', 'FORMAT', 'LOGICAL', 'CODEGEN', 'CAST', 'SHOW', 'TABLES', 'COLUMNS', 'COLUMN', 'USE', 'PARTITIONS', 'FUNCTIONS', 'DROP', 'UNION', 'EXCEPT', 'MINUS', 'INTERSECT', 'TO', 'TABLESAMPLE', 'STRATIFY', 'ALTER', 'RENAME', 'ARRAY', 'MAP', 'STRUCT', 'COMMENT', 'SET', 'RESET', 'DATA', 'START', 'TRANSACTION', 'COMMIT', 'ROLLBACK', 'MACRO', 'IF', EQ, '<=>', '<>', '!=', '<', LTE, '>', GTE, '+', '-', '*', '/', '%', 'DIV', '&', '|', '^', 'PERCENT', 'BUCKET', 'OUT', 'OF', 'SORT', 'CLUSTER', 'DISTRIBUTE', 'OVERWRITE', 'TRANSFORM', 'REDUCE', 'USING', 'SERDE', 'SERDEPROPERTIES', 'RECORDREADER', 'RECORDWRITER', 'DELIMITED', 'FIELDS', 'TERMINATED', 'COLLECTION', 'ITEMS', 'KEYS', 'ESCAPED', 'LINES', 'SEPARATED', 'FUNCTION', 'EXTENDED', 'REFRESH', 'CLEAR', 'CACHE', 'UNCACHE', 'LAZY', 'FORMATTED', 'GLOBAL', TEMPORARY, 'OPTIONS', 'UNSET', 'TBLPROPERTIES', 'DBPROPERTIES', 'BUCKETS', 'SKEWED', 'STORED', 'DIRECTORIES', 'LOCATION', 'EXCHANGE', 'ARCHIVE', 'UNARCHIVE', 'FILEFORMAT', 'TOUCH', 'COMPACT', 'CONCATENATE', 'CHANGE', 'CASCADE', 'RESTRICT', 'CLUSTERED', 'SORTED', 'PURGE', 'INPUTFORMAT', 'OUTPUTFORMAT', DATABASE, DATABASES, 'DFS', 'TRUNCATE', 'ANALYZE', 'COMPUTE', 'LIST', 'STATISTICS', 'PARTITIONED', 'EXTERNAL', 'DEFINED', 'REVOKE', 'GRANT', 'LOCK', 'UNLOCK', 'MSCK', 'REPAIR', 'RECOVER', 'EXPORT', 'IMPORT', 'LOAD', 'ROLE', 'ROLES', 'COMPACTIONS', 'PRINCIPALS', 'TRANSACTIONS', 'INDEX', 'INDEXES', 'LOCKS', 'OPTION', 'ANTI', 'LOCAL', 'INPATH', 'CURRENT_DATE', 'CURRENT_TIMESTAMP', IDENTIFIER, BACKQUOTED_IDENTIFIER}(line 1, pos 13) == SQL == SELECT LIMIT 20 PERCENT -------------^^^ NationalIDNumber ,JobTitle ,BirthDate FROM HumanResources_EmployeeAny further suggestions will be very much welcomed. Hello community, I don't think the problem is with using either TOP or LIMIT. I think the problem is the number 20. But I don't know how to reformat a number? Any thoughts greatly appreciated. Cheers RE: pyspark sql unable to recognize SQL query command - micseydel - Jul-30-2018 Apologies, I could have been more clear. The LIMIT clause is different from TOP, it comes at the end rather than the beginning. If I'm right, you should be reading up on your SQL dialect rather than anything specific to PySpark. That might make your testing a lot easier as well. RE: pyspark sql unable to recognize SQL query command - cpatte7372 - Jul-31-2018 micseydel, Thanks for sticking with me on this issue. I have made the code as easy as possible. The code now looks like this: df = spark.read.csv('/home/packt/Downloads/Spark_DataFrames/HumanResources_Employee.csv',inferSchema=True,header=True) df.registerTempTable("HumanResources_Employee") df = spark.read.csv('/home/packt/Downloads/Spark_DataFrames/HumanResources_Employee.csv',inferSchema=True,header=True) df.createOrReplaceTempView('HumanResources_Employee') myresults = spark.sql("""SELECT TOP 20 * FROM HumanResources_Employee""") myresults.show()I do believe the TOP keyword should work with pyspark.sql However, I'm getting the error: Py4JJavaError: An error occurred while calling o19.sql. : org.apache.spark.sql.catalyst.parser.ParseException: extraneous input '20' expecting {<EOF>, '(', ',', '.', '[', 'SELECT', 'FROM', 'ADD', 'AS', 'ALL', 'DISTINCT', 'WHERE', 'GROUP', 'BY', 'GROUPING', 'SETS', 'CUBE', 'ROLLUP', 'ORDER', 'HAVING', 'LIMIT', 'AT', 'OR', 'AND', 'IN', NOT, 'NO', 'EXISTS', 'BETWEEN', 'LIKE', RLIKE, 'IS', 'NULL', 'TRUE', 'FALSE', 'NULLS', 'ASC', 'DESC', 'FOR', 'INTERVAL', 'CASE', 'WHEN', 'THEN', 'ELSE', 'END', 'JOIN', 'CROSS', 'OUTER', 'INNER', 'LEFT', 'SEMI', 'RIGHT', 'FULL', 'NATURAL', 'ON', 'LATERAL', 'WINDOW', 'OVER', 'PARTITION', 'RANGE', 'ROWS', 'UNBOUNDED', 'PRECEDING', 'FOLLOWING', 'CURRENT', 'FIRST', 'LAST', 'ROW', 'WITH', 'VALUES', 'CREATE', 'TABLE', 'VIEW', 'REPLACE', 'INSERT', 'DELETE', 'INTO', 'DESCRIBE', 'EXPLAIN', 'FORMAT', 'LOGICAL', 'CODEGEN', 'CAST', 'SHOW', 'TABLES', 'COLUMNS', 'COLUMN', 'USE', 'PARTITIONS', 'FUNCTIONS', 'DROP', 'UNION', 'EXCEPT', 'MINUS', 'INTERSECT', 'TO', 'TABLESAMPLE', 'STRATIFY', 'ALTER', 'RENAME', 'ARRAY', 'MAP', 'STRUCT', 'COMMENT', 'SET', 'RESET', 'DATA', 'START', 'TRANSACTION', 'COMMIT', 'ROLLBACK', 'MACRO', 'IF', EQ, '<=>', '<>', '!=', '<', LTE, '>', GTE, '+', '-', '*', '/', '%', 'DIV', '&', '|', '^', 'PERCENT', 'BUCKET', 'OUT', 'OF', 'SORT', 'CLUSTER', 'DISTRIBUTE', 'OVERWRITE', 'TRANSFORM', 'REDUCE', 'USING', 'SERDE', 'SERDEPROPERTIES', 'RECORDREADER', 'RECORDWRITER', 'DELIMITED', 'FIELDS', 'TERMINATED', 'COLLECTION', 'ITEMS', 'KEYS', 'ESCAPED', 'LINES', 'SEPARATED', 'FUNCTION', 'EXTENDED', 'REFRESH', 'CLEAR', 'CACHE', 'UNCACHE', 'LAZY', 'FORMATTED', 'GLOBAL', TEMPORARY, 'OPTIONS', 'UNSET', 'TBLPROPERTIES', 'DBPROPERTIES', 'BUCKETS', 'SKEWED', 'STORED', 'DIRECTORIES', 'LOCATION', 'EXCHANGE', 'ARCHIVE', 'UNARCHIVE', 'FILEFORMAT', 'TOUCH', 'COMPACT', 'CONCATENATE', 'CHANGE', 'CASCADE', 'RESTRICT', 'CLUSTERED', 'SORTED', 'PURGE', 'INPUTFORMAT', 'OUTPUTFORMAT', DATABASE, DATABASES, 'DFS', 'TRUNCATE', 'ANALYZE', 'COMPUTE', 'LIST', 'STATISTICS', 'PARTITIONED', 'EXTERNAL', 'DEFINED', 'REVOKE', 'GRANT', 'LOCK', 'UNLOCK', 'MSCK', 'REPAIR', 'RECOVER', 'EXPORT', 'IMPORT', 'LOAD', 'ROLE', 'ROLES', 'COMPACTIONS', 'PRINCIPALS', 'TRANSACTIONS', 'INDEX', 'INDEXES', 'LOCKS', 'OPTION', 'ANTI', 'LOCAL', 'INPATH', 'CURRENT_DATE', 'CURRENT_TIMESTAMP', IDENTIFIER, BACKQUOTED_IDENTIFIER}(line 1, pos 11) == SQL == SELECT TOP 20 * FROM HumanResources_Employee -----------^^^Any thoughts most welcomed micseydel, I re-read your post and I think I understand what you're saying. I changed the code to; myresults = spark.sql("SELECT * FROM HumanResources_Employee LIMIT 20") And it worked. Can you show me how to show 20 percent, instead of just 20 rows? RE: pyspark sql unable to recognize SQL query command - micseydel - Jul-31-2018 (Jul-31-2018, 08:20 AM)cpatte7372 Wrote: And it worked. Can you show me how to show 20 percent, instead of just 20 rows?This isn't really a Python question, but rather a SQL question for your specific dialect. Different SQLs have different ways of doing things. You should be able to find an answer to this by Googling, but if not, you need to provide your SQL version here or we'd have to guess at the solution. |