thanks for the reply. I thought that a "map" iterates over the RDD and the function is applied to each element. So hence, I assumed the lambda replaces the function. I am new to both pyspark and lambdas, so perhaps I may need to do a bit more reading before I fully understand how to replace the function with a lambda.
the following output occurs with using map(reGrpLst):
thanks for replying.
the following output occurs with using map(reGrpLst):
[('file:/home/big_data/code/files/mansfield_park.txt', [23500, 17735, 9735, 16784, 14154, 16389, 12905, 27261, 7562, 17959]), ('file:/home/big_data/code/files/kjv.txt', [106189, 109173, 71421, 88498, 69612, 96175, 53168, 167502, 51475, 77898]), ('file:/home/big_data/code/files/hamlet.txt', [3941, 3298, 2460, 3922, 3227, 3581, 2671, 4767, 1974, 3211])]whereas this is the error generated with the comprehensions:
Py4JJavaError Traceback (most recent call last) <ipython-input-7-0c76eff16ea6> in <module>() 46 f_wcL2_RDD = f_wcL_RDD.reduceByKey(add) #<<< create [(w,c), ... ,(w,c)] lists per file 47 f_wVec_RDD = f_wcL2_RDD.map(lambda f_wc: (f_wc[0],hashing_vectorizer(f_wc[1],N))) ---> 48 print(f_wVec_RDD.top(3))Apologies for not posting the error to begin with, but as you can see the error does not occur specifically on the map(lambda), but rather on the print statement that follows.
thanks for replying.