cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

take() operation throwing index out of range error

shelly
New Contributor

x=[1,2,3,4,5,6,7]

rdd = sc.parallelize(x)

print (rdd.take(2))

Traceback (most recent call last):
  File "/usr/local/spark/python/pyspark/serializers.py", line 458, in dumps
    return cloudpickle.dumps(obj, pickle_protocol)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/spark/python/pyspark/cloudpickle/cloudpickle_fast.py", line 73, in dumps
    cp.dump(obj)
  File "/usr/local/spark/python/pyspark/cloudpickle/cloudpickle_fast.py", line 602, in dump
    return Pickler.dump(self, obj)
           ^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/spark/python/pyspark/cloudpickle/cloudpickle_fast.py", line 692, in reducer_override
    return self._function_reduce(obj)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/spark/python/pyspark/cloudpickle/cloudpickle_fast.py", line 565, in _function_reduce
    return self._dynamic_function_reduce(obj)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/spark/python/pyspark/cloudpickle/cloudpickle_fast.py", line 546, in _dynamic_function_reduce
    state = _function_getstate(func)
            ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/spark/python/pyspark/cloudpickle/cloudpickle_fast.py", line 157, in _function_getstate
    f_globals_ref = _extract_code_globals(func.__code__)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/spark/python/pyspark/cloudpickle/cloudpickle.py", line 334, in _extract_code_globals
    out_names = {names[oparg]: None for _, oparg in _walk_global_ops(co)}
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/spark/python/pyspark/cloudpickle/cloudpickle.py", line 334, in <dictcomp>
    out_names = {names[oparg]: None for _, oparg in _walk_global_ops(co)}
                 ~~~~~^^^^^^^
IndexError: tuple index out of range

3 REPLIES 3

pvignesh92
Honored Contributor

Hi @Shelly Bhardwaj​ This should work. Can you restart your Jupiter terminal and execute this and check?

Anonymous
Not applicable

@Shelly Bhardwaj​ :

The error IndexError: tuple index out of range suggests that there is an issue with the serialization of the RDD using PySpark's cloudpickle library. This can happen when the size of the data being serialized exceeds the maximum size limit for cloudpickle. One way to overcome this issue is to use a different serialization method. You can try using PySpark's pickle serializer instead of cloudpickle. You can set the serializer using the SparkConf object before creating the SparkContext:

from pyspark import SparkConf, SparkContext
 
conf = SparkConf().setAppName("myApp").set("spark.serializer", "org.apache.spark.serializer.PickleSerializer")
sc = SparkContext(conf=conf)
 
x = [1, 2, 3, 4, 5, 6, 7]
rdd = sc.parallelize(x)
print(rdd.take(2))

Alternatively, you can try reducing the size of the RDD by filtering or partitioning the data before serializing it.

Anonymous
Not applicable

Hi @Shelly Bhardwaj​ 

Hope all is well! Just wanted to check in if you were able to resolve your issue and would you be happy to share the solution or mark an answer as best? Else please let us know if you need more help. 

We'd love to hear from you.

Thanks!

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.