05-24-2024 08:27 AM
Hello,
I have a quick question. If my source code call pysark collect() or any method related to rdd methods, then pytest on my local PC will report the following error. My local machine doesn't have any specific setting for pyspark and I used findspark package. If you know the solution it will be greatly appreciated. Thanks.
src\tests\test_calculate_psi_for_each_column.py:14: in <module>
spark = spu.create_spark_obj()
src\prototype\supplementary_utils.py:959: in create_spark_obj
spark = SparkSession.builder.appName("ABC").getOrCreate()
venv\lib\site-packages\pyspark\sql\session.py:269: in getOrCreate
sc = SparkContext.getOrCreate(sparkConf)
venv\lib\site-packages\pyspark\context.py:483: in getOrCreate
SparkContext(conf=conf or SparkConf())
venv\lib\site-packages\pyspark\context.py:197: in __init__
self._do_init(
venv\lib\site-packages\pyspark\context.py:282: in _do_init
self._jsc = jsc or self._initialize_context(self._conf._jconf)
venv\lib\site-packages\pyspark\context.py:402: in _initialize_context
return self._jvm.JavaSparkContext(jconf)
venv\lib\site-packages\py4j\java_gateway.py:1585: in __call__
return_value = get_return_value(
venv\lib\site-packages\py4j\protocol.py:326: in get_return_value
raise Py4JJavaError(
E py4j.protocol.Py4JJavaError: An error occurred while calling None.org.apache.spark.api.java.JavaSparkContext.
E : java.lang.ExceptionInInitializerError
E at org.apache.spark.unsafe.array.ByteArrayMethods.<clinit>(ByteArrayMethods.java:56)
E at org.apache.spark.memory.MemoryManager.defaultPageSizeBytes$lzycompute(MemoryManager.scala:264)
E at org.apache.spark.memory.MemoryManager.defaultPageSizeBytes(MemoryManager.scala:254)
E at org.apache.spark.memory.MemoryManager.$anonfun$pageSizeBytes$1(MemoryManager.scala:273)
E at scala.runtime.java8.JFunction0$mcJ$sp.apply(JFunction0$mcJ$sp.java:23)
E at scala.Option.getOrElse(Option.scala:189)
E at org.apache.spark.memory.MemoryManager.<init>(MemoryManager.scala:273)
E at org.apache.spark.memory.UnifiedMemoryManager.<init>(UnifiedMemoryManager.scala:58)
E at org.apache.spark.memory.UnifiedMemoryManager$.apply(UnifiedMemoryManager.scala:207)
E at org.apache.spark.SparkEnv$.create(SparkEnv.scala:320)
E at org.apache.spark.SparkEnv$.createDriverEnv(SparkEnv.scala:194)
E at org.apache.spark.SparkContext.createSparkEnv(SparkContext.scala:279)
E at org.apache.spark.SparkContext.<init>(SparkContext.scala:464)
E at org.apache.spark.api.java.JavaSparkContext.<init>(JavaSparkContext.scala:58)
E at java.base/jdk.internal.reflect.DirectConstructorHandleAccessor.newInstance(DirectConstructorHandleAccessor.java:62)
E at java.base/java.lang.reflect.Constructor.newInstanceWithCaller(Constructor.java:502)
E at java.base/java.lang.reflect.Constructor.newInstance(Constructor.java:486)
E at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:247)
E at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
E at py4j.Gateway.invoke(Gateway.java:238)
E at py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80)
E at py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69)
E at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
E at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
E at java.base/java.lang.Thread.run(Thread.java:1570)
E Caused by: java.lang.IllegalStateException: java.lang.NoSuchMethodException: java.nio.DirectByteBuffer.<init>(long,int)
E at org.apache.spark.unsafe.Platform.<clinit>(Platform.java:113)
E ... 25 more
E Caused by: java.lang.NoSuchMethodException: java.nio.DirectByteBuffer.<init>(long,int)
E at java.base/java.lang.Class.getConstructor0(Class.java:3784)
E at java.base/java.lang.Class.getDeclaredConstructor(Class.java:2955)
E at org.apache.spark.unsafe.Platform.<clinit>(Platform.java:71)
E ... 25 more
05-24-2024 10:46 AM
Hi,
The error message and stack trace doesn't seem to suggest that this is a failed pytest issue. If that's true, can you please try to replicate what's being invoked from `src\tests\test_calculate_psi_for_each_column.py` in a Python REPL?
And separately, have you confirmed that the pyspark installation was successful? Depending on how you installed pyspark, are you able to start pyspark by itself (independent of `findspark`) such as `$SPARK_HOME/bin/pyspark`?
Thanks.
05-27-2024 06:43 PM
When I directly run pyspark, it failed.
C:\Users\HXI1\hx_scripts\core-globaldata-dpm-mlops-pipeline-1\venv\Lib\site-packages\pyspark\bin\pyspark
Python 3.9.13 (main, Aug 25 2022, 23:51:50) [MSC v.1916 64 bit (AMD64)] :: Anaconda, Inc. on win32
Type "help", "copyright", "credits" or "license" for more information.
24/05/27 20:40:38 WARN Shell: Did not find winutils.exe: java.io.FileNotFoundException: java.io.FileNotFoundException: HADOOP_HOME and hadoop.home.dir are unset. -see https://wiki.apache.org/hadoop/WindowsProblems
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
24/05/27 20:40:39 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
24/05/27 20:40:40 WARN SparkContext: Another SparkContext is being constructed (or threw an exception in its constructor). This may indicate an error, since only one SparkContext should be running in this JVM (see SPARK-2243). The other SparkContext was created at:
org.apache.spark.api.java.JavaSparkContext.<init>(JavaSparkContext.scala:58)
java.base/jdk.internal.reflect.DirectConstructorHandleAccessor.newInstance(DirectConstructorHandleAccessor.java:62)
java.base/java.lang.reflect.Constructor.newInstanceWithCaller(Constructor.java:502)
java.base/java.lang.reflect.Constructor.newInstance(Constructor.java:486)
py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:247)
py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
py4j.Gateway.invoke(Gateway.java:238)
py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80)
py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69)
py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
py4j.ClientServerConnection.run(ClientServerConnection.java:106)
java.base/java.lang.Thread.run(Thread.java:1570)
C:\Users\HXI1\hx_scripts\core-globaldata-dpm-mlops-pipeline-1\venv\Lib\site-packages\pyspark\bin\..\python\pyspark\shell.py:44: UserWarning: Failed to initialize Spark session.
warnings.warn("Failed to initialize Spark session.")
Traceback (most recent call last):
File "C:\Users\HXI1\hx_scripts\core-globaldata-dpm-mlops-pipeline-1\venv\Lib\site-packages\pyspark\bin\..\python\pyspark\shell.py", line 39, in <module>
spark = SparkSession._create_shell_session()
File "C:\Users\HXI1\hx_scripts\core-globaldata-dpm-mlops-pipeline-1\venv\lib\site-packages\pyspark\sql\session.py", line 677, in _create_shell_session
return SparkSession._getActiveSessionOrCreate()
File "C:\Users\HXI1\hx_scripts\core-globaldata-dpm-mlops-pipeline-1\venv\lib\site-packages\pyspark\sql\session.py", line 693, in _getActiveSessionOrCreate
spark = builder.getOrCreate()
File "C:\Users\HXI1\hx_scripts\core-globaldata-dpm-mlops-pipeline-1\venv\lib\site-packages\pyspark\sql\session.py", line 269, in getOrCreate
sc = SparkContext.getOrCreate(sparkConf)
File "C:\Users\HXI1\hx_scripts\core-globaldata-dpm-mlops-pipeline-1\venv\lib\site-packages\pyspark\context.py", line 483, in getOrCreate SparkContext(conf=conf or SparkConf())
File "C:\Users\HXI1\hx_scripts\core-globaldata-dpm-mlops-pipeline-1\venv\lib\site-packages\pyspark\context.py", line 197, in __init__
self._do_init(
File "C:\Users\HXI1\hx_scripts\core-globaldata-dpm-mlops-pipeline-1\venv\lib\site-packages\pyspark\context.py", line 282, in _do_init
self._jsc = jsc or self._initialize_context(self._conf._jconf)
File "C:\Users\HXI1\hx_scripts\core-globaldata-dpm-mlops-pipeline-1\venv\lib\site-packages\pyspark\context.py", line 402, in _initialize_context
return self._jvm.JavaSparkContext(jconf)
File "C:\Users\HXI1\hx_scripts\core-globaldata-dpm-mlops-pipeline-1\venv\Lib\site-packages\pyspark\python\lib\py4j-0.10.9.5-src.zip\py4j\java_gateway.py", line 1585, in __call__
return_value = get_return_value(
File "C:\Users\HXI1\hx_scripts\core-globaldata-dpm-mlops-pipeline-1\venv\Lib\site-packages\pyspark\python\lib\py4j-0.10.9.5-src.zip\py4j\protocol.py", line 326, in get_return_value
raise Py4JJavaError(
py4j.protocol.Py4JJavaError: An error occurred while calling None.org.apache.spark.api.java.JavaSparkContext.
: java.lang.NoClassDefFoundError: Could not initialize class org.apache.spark.unsafe.array.ByteArrayMethods
at org.apache.spark.memory.MemoryManager.defaultPageSizeBytes$lzycompute(MemoryManager.scala:264)
at org.apache.spark.memory.MemoryManager.defaultPageSizeBytes(MemoryManager.scala:254)
at org.apache.spark.memory.MemoryManager.$anonfun$pageSizeBytes$1(MemoryManager.scala:273)
at scala.runtime.java8.JFunction0$mcJ$sp.apply(JFunction0$mcJ$sp.java:23)
at scala.Option.getOrElse(Option.scala:189)
at org.apache.spark.memory.MemoryManager.<init>(MemoryManager.scala:273)
at org.apache.spark.memory.UnifiedMemoryManager.<init>(UnifiedMemoryManager.scala:58)
at org.apache.spark.memory.UnifiedMemoryManager$.apply(UnifiedMemoryManager.scala:207)
at org.apache.spark.SparkEnv$.create(SparkEnv.scala:320)
at org.apache.spark.SparkEnv$.createDriverEnv(SparkEnv.scala:194)
at org.apache.spark.SparkContext.createSparkEnv(SparkContext.scala:279)
at org.apache.spark.SparkContext.<init>(SparkContext.scala:464)
at org.apache.spark.api.java.JavaSparkContext.<init>(JavaSparkContext.scala:58)
at java.base/jdk.internal.reflect.DirectConstructorHandleAccessor.newInstance(DirectConstructorHandleAccessor.java:62)
at java.base/java.lang.reflect.Constructor.newInstanceWithCaller(Constructor.java:502)
at java.base/java.lang.reflect.Constructor.newInstance(Constructor.java:486)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:247)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:238)
at py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80)
at py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69)
at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
at java.base/java.lang.Thread.run(Thread.java:1570)
Caused by: java.lang.ExceptionInInitializerError: Exception java.lang.ExceptionInInitializerError [in thread "Thread-2"]
at org.apache.spark.unsafe.array.ByteArrayMethods.<clinit>(ByteArrayMethods.java:56)
... 24 more
ERROR: The process with PID 33248 (child process of PID 43512) could not be terminated.
Reason: Access is denied.
SUCCESS: The process with PID 43512 (child process of PID 14852) has been terminated.
SUCCESS: The process with PID 14852 (child process of PID 42180) has been terminated.
05-27-2024 09:07 PM
I dont personally have any experience running Spark on Windows. Can you please review the Wiki article referenced in the WARN message to see if it helps you complete the installation successfully? Or alternatively, could you consider running the tests on Databricks if you continue having issues with the Windows setup?
Thanks.
05-30-2024 10:00 AM
Thank you very much, brockb. Probably I will try it in databricks. Thanks.
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.
If there isn’t a group near you, start one and help create a community that brings people together.
Request a New Group