Python Forum
map without a function error
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
map without a function error
#1
Hi

I am using pyspark, and whenever I attempt to use a comprehension it generates an error. At first I thought the issue was pyspark specific, but I think I am not coding the comprehension correctly.

The following works as expected in pyspark:
def reGrpLst(fw_c):

    fw,c = fw_c
    f,w = fw
    return (f,[(w,c)])


f_wcL_RDD = fw_c_RDD.map(reGrpLst)
but when I attempt to rewrite using a comprehension as either
f_wcL_RDD = fw_c_RDD.map(lambda fw_c: [ (fw[0],  (fw[1] ,fw_c[1]) ) for fw in fw_c[0] ] )
or
f_wcL_RDD = fw_c_RDD.map(lambda fw_c: [ (f,  (c ,fw_c[1]) ) for f,c in fw_c[0] ] )
A pyspark error is generated. Is it because the comprehension, in both of the above cases, are incorrect?


Any suggestions would be helpful.
Thanks
Reply
#2
The named function you're using doesn't have a comprehension in it. The lambdas you've written do. It seems here that you expect them to be equivalent, but they're not. This seems more like a regular Python issue than anything specific to pyspark. It would have helped if you had provided the error you got, but here's my showing that your lambdas have different behavior than your function
>>> def reGrpLst(fw_c):
   fw,c = fw_c
   f,w = fw
   return (f,[(w,c)])

>>> reGrpLst([[1, 2], 3])
(1, [(2, 3)])
>>> 
>>> (lambda fw_c: [ (fw[0],  (fw[1] ,fw_c[1]) ) for fw in fw_c[0] ])([[1, 2], 3])

Traceback (most recent call last):
 File "<pyshell#4>", line 1, in <module>
   (lambda fw_c: [ (fw[0],  (fw[1] ,fw_c[1]) ) for fw in fw_c[0] ])([[1, 2], 3])
 File "<pyshell#4>", line 1, in <lambda>
   (lambda fw_c: [ (fw[0],  (fw[1] ,fw_c[1]) ) for fw in fw_c[0] ])([[1, 2], 3])
TypeError: 'int' object has no attribute '__getitem__'
>>> 
>>> (lambda fw_c: [ (f,  (c ,fw_c[1]) ) for f,c in fw_c[0] ] )([[1, 2], 3])

Traceback (most recent call last):
 File "<pyshell#8>", line 1, in <module>
   (lambda fw_c: [ (f,  (c ,fw_c[1]) ) for f,c in fw_c[0] ] )([[1, 2], 3])
 File "<pyshell#8>", line 1, in <lambda>
   (lambda fw_c: [ (f,  (c ,fw_c[1]) ) for f,c in fw_c[0] ] )([[1, 2], 3])
TypeError: 'int' object is not iterable
Reply
#3
thanks for the reply. I thought that a "map" iterates over the RDD and the function is applied to each element. So hence, I assumed the lambda replaces the function. I am new to both pyspark and lambdas, so perhaps I may need to do a bit more reading before I fully understand how to replace the function with a lambda.

the following output occurs with using map(reGrpLst):
[('file:/home/big_data/code/files/mansfield_park.txt', [23500, 17735, 9735, 16784, 14154, 16389, 12905, 27261, 7562, 17959]), ('file:/home/big_data/code/files/kjv.txt', [106189, 109173, 71421, 88498, 69612, 96175, 53168, 167502, 51475, 77898]), ('file:/home/big_data/code/files/hamlet.txt', [3941, 3298, 2460, 3922, 3227, 3581, 2671, 4767, 1974, 3211])]
whereas this is the error generated with the comprehensions:
Py4JJavaError                             Traceback (most recent call last)
<ipython-input-7-0c76eff16ea6> in <module>()
     46 f_wcL2_RDD = f_wcL_RDD.reduceByKey(add) #<<< create [(w,c), ... ,(w,c)] lists per file
     47 f_wVec_RDD = f_wcL2_RDD.map(lambda f_wc: (f_wc[0],hashing_vectorizer(f_wc[1],N)))
---> 48 print(f_wVec_RDD.top(3))
Apologies for not posting the error to begin with, but as you can see the error does not occur specifically on the map(lambda), but rather on the print statement that follows.

thanks for replying.
Reply
#4
(Feb-18-2017, 07:43 PM)bluefrog Wrote: I thought that a "map" iterates over the RDD and the function is applied to each element.
Yes. More generally, map() applies a function to all elements of an iterable.

(Feb-18-2017, 07:43 PM)bluefrog Wrote: hence, I assumed the lambda replaces the function
If the lambdas evaluated to the same thing as the function, then it could replace it. But it doesn't. Here's another refactoring
import traceback

def reGrpLst(fw_c): 
    fw,c = fw_c
    f,w = fw
    return (f,[(w,c)])

def reGrpLst_firstLambda(fw_c):
    return [(fw[0], (fw[1], fw_c[1])) for fw in fw_c[0]]

def reGrpLst_secondLambda(fw_c):
    return [(f, (c, fw_c[1])) for f, c in fw_c[0]]

INPUT_LIST = [((1, 2), 3)]

functions = (reGrpLst, reGrpLst_firstLambda, reGrpLst_secondLambda)
for f in functions:
    print "Trying", f
    try:
        print map(f, INPUT_LIST)
    except:
        traceback.print_exc()
    print
Output:
Trying <function reGrpLst at 0x7fa960914578> [(1, [(2, 3)])] Trying <function reGrpLst_firstLambda at 0x7fa9609145f0> Traceback (most recent call last): File "testit.py", line 20, in <module> print map(f, INPUT_LIST) File "testit.py", line 9, in reGrpLst_firstLambda return [(fw[0], (fw[1], fw_c[1])) for fw in fw_c[0]] TypeError: 'int' object has no attribute '__getitem__' Trying <function reGrpLst_secondLambda at 0x7fa960914668> Traceback (most recent call last): File "testit.py", line 20, in <module> print map(f, INPUT_LIST) File "testit.py", line 12, in reGrpLst_secondLambda return [(f, (c, fw_c[1])) for f, c in fw_c[0]] TypeError: 'int' object is not iterable
This has nothing to do with pyspark. I used Python's built-in map() here instead of the special one. Don't worry about pyspark until you've figured this out with regular Python, since pyspark is a complicating factor.

A lambda is generally just like a regular function. Above, I turned your lambdas into full functions. Can you see that the full functions aren't all the same?

Comprehensions are an intermediate difficulty Python feature. They themselves are like syntactic sugar for maps. If you do a map with a lambda that does a comprehension, then it's liked nested mapping, or a nested loop. If you keep struggling with this, I highly recommend just not using a comprehension. They're not strictly necessary. Stick to simpler code, and tackle comprehensions again later on.
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  Error in using the output of one function in another function (beginner) MadsPJ 6 4,976 Mar-13-2017, 03:06 PM
Last Post: MadsPJ

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020