Databricks Community

shelly · ‎03-28-2023

x=[1,2,3,4,5,6,7]

rdd = sc.parallelize(x)

print (rdd.take(2))

Traceback (most recent call last):
  File "/usr/local/spark/python/pyspark/serializers.py", line 458, in dumps
    return cloudpickle.dumps(obj, pickle_protocol)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/spark/python/pyspark/cloudpickle/cloudpickle_fast.py", line 73, in dumps
    cp.dump(obj)
  File "/usr/local/spark/python/pyspark/cloudpickle/cloudpickle_fast.py", line 602, in dump
    return Pickler.dump(self, obj)
           ^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/spark/python/pyspark/cloudpickle/cloudpickle_fast.py", line 692, in reducer_override
    return self._function_reduce(obj)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/spark/python/pyspark/cloudpickle/cloudpickle_fast.py", line 565, in _function_reduce
    return self._dynamic_function_reduce(obj)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/spark/python/pyspark/cloudpickle/cloudpickle_fast.py", line 546, in _dynamic_function_reduce
    state = _function_getstate(func)
            ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/spark/python/pyspark/cloudpickle/cloudpickle_fast.py", line 157, in _function_getstate
    f_globals_ref = _extract_code_globals(func.__code__)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/spark/python/pyspark/cloudpickle/cloudpickle.py", line 334, in _extract_code_globals
    out_names = {names[oparg]: None for _, oparg in _walk_global_ops(co)}
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/spark/python/pyspark/cloudpickle/cloudpickle.py", line 334, in <dictcomp>
    out_names = {names[oparg]: None for _, oparg in _walk_global_ops(co)}
                 ~~~~~^^^^^^^
IndexError: tuple index out of range

pvignesh92 · ‎03-29-2023

Hi @Shelly Bhardwaj This should work. Can you restart your Jupiter terminal and execute this and check?

Anonymous · ‎04-01-2023

@Shelly Bhardwaj :

The error IndexError: tuple index out of range suggests that there is an issue with the serialization of the RDD using PySpark's cloudpickle library. This can happen when the size of the data being serialized exceeds the maximum size limit for cloudpickle. One way to overcome this issue is to use a different serialization method. You can try using PySpark's pickle serializer instead of cloudpickle. You can set the serializer using the SparkConf object before creating the SparkContext:

from pyspark import SparkConf, SparkContext
 
conf = SparkConf().setAppName("myApp").set("spark.serializer", "org.apache.spark.serializer.PickleSerializer")
sc = SparkContext(conf=conf)
 
x = [1, 2, 3, 4, 5, 6, 7]
rdd = sc.parallelize(x)
print(rdd.take(2))

Alternatively, you can try reducing the size of the RDD by filtering or partitioning the data before serializing it.

Anonymous · ‎04-03-2023

Hi @Shelly Bhardwaj

Hope all is well! Just wanted to check in if you were able to resolve your issue and would you be happy to share the solution or mark an answer as best? Else please let us know if you need more help.

We'd love to hear from you.

Thanks!