<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic from pyspark.ml.stat import KolmogorovSmirnovTest is not working on Serverless compute. in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/from-pyspark-ml-stat-import-kolmogorovsmirnovtest-is-not-working/m-p/133319#M49798</link>
    <description>&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Hi everyone,&lt;/P&gt;&lt;P&gt;I am trying to run a Kolmogorov–Smirnov (KS) test on a Spark DataFrame column in Databricks using the built-in pyspark.ml.stat.KolmogorovSmirnovTest. The goal is to apply the KS test directly on Spark DataFrame data without converting it into Pandas or NumPy.&lt;/P&gt;&lt;P&gt;Here’s the snippet I’m using:&lt;/P&gt;&lt;DIV class=""&gt;&lt;DIV class=""&gt;&lt;SPAN&gt;&lt;SPAN&gt;&lt;SPAN class=""&gt;from pyspark.ml.stat &lt;SPAN class=""&gt;import KolmogorovSmirnovTest result = KolmogorovSmirnovTest.test(df, &lt;SPAN class=""&gt;"value", &lt;SPAN class=""&gt;"norm", &lt;SPAN class=""&gt;0.0, &lt;SPAN class=""&gt;1.0).collect()[&lt;SPAN class=""&gt;0] &lt;SPAN class=""&gt;print(result.statistic, result.pValue)&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;P&gt;And the error I get is:&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;DIV class=""&gt;&lt;DIV class=""&gt;&lt;DIV class=""&gt;&lt;DIV class=""&gt;&amp;nbsp;&lt;DIV class=""&gt;&lt;DIV&gt;&lt;DIV class=""&gt;&lt;SPAN class=""&gt;&lt;SPAN class=""&gt;AssertionError:&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;DIV&gt;&lt;DIV class=""&gt;File &lt;SPAN class=""&gt;&lt;SPAN class=""&gt;&lt;A target="_blank"&gt;&amp;lt;command-8698289323550994&amp;gt;, line 26 &lt;SPAN&gt;23 &lt;SPAN class=""&gt;else: &lt;SPAN&gt;24 &lt;SPAN class=""&gt;return df&lt;SPAN&gt;.sparkSession&lt;SPAN&gt;.createDataFrame([], &lt;SPAN&gt;"&lt;SPAN&gt;column STRING, statistic DOUBLE, pValue DOUBLE&lt;SPAN&gt;") &lt;SPAN class=""&gt;---&amp;gt; 26 out &lt;SPAN&gt;= summarize_normality_numeric_columns_test(df&lt;SPAN&gt;=spark_df, table_name&lt;SPAN&gt;=&lt;SPAN&gt;"&lt;SPAN&gt;sample&lt;SPAN&gt;") &lt;SPAN&gt;27 display(out)&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/A&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;DIV class=""&gt;&lt;HR /&gt;&lt;DIV class=""&gt;File &lt;SPAN class=""&gt;&lt;SPAN class=""&gt;/databricks/python/lib/python3.12/site-packages/pyspark/ml/stat.py:249, in &lt;SPAN class=""&gt;KolmogorovSmirnovTest.test&lt;SPAN class=""&gt;(dataset, sampleCol, distName, *params) &lt;SPAN&gt;246 &lt;SPAN class=""&gt;from &lt;SPAN class=""&gt;pyspark&lt;SPAN class=""&gt;.&lt;SPAN class=""&gt;core&lt;SPAN class=""&gt;.&lt;SPAN class=""&gt;context &lt;SPAN class=""&gt;import SparkContext &lt;SPAN&gt;248 sc &lt;SPAN&gt;= SparkContext&lt;SPAN&gt;._active_spark_context &lt;SPAN class=""&gt;--&amp;gt; 249 &lt;SPAN class=""&gt;assert sc &lt;SPAN class=""&gt;is &lt;SPAN class=""&gt;not &lt;SPAN class=""&gt;None &lt;SPAN&gt;251 javaTestObj &lt;SPAN&gt;= &lt;SPAN&gt;getattr(_jvm(), &lt;SPAN&gt;"&lt;SPAN&gt;org.apache.spark.ml.stat.KolmogorovSmirnovTest&lt;SPAN&gt;") &lt;SPAN&gt;252 dataset &lt;SPAN&gt;= _py2java(sc, dataset)&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;P&gt;&lt;BR /&gt;&lt;BR /&gt;It seems like the KolmogorovSmirnovTest module isn’t supported in Serverless compute, or it behaves differently compared to standard clusters.&lt;/P&gt;&lt;P&gt;Has anyone faced this issue? Is KS test currently unsupported in Databricks Serverless, or is there a workaround?&lt;/P&gt;&lt;P&gt;Thanks in advance!&lt;/P&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;</description>
    <pubDate>Tue, 30 Sep 2025 07:13:42 GMT</pubDate>
    <dc:creator>parthesh24</dc:creator>
    <dc:date>2025-09-30T07:13:42Z</dc:date>
    <item>
      <title>from pyspark.ml.stat import KolmogorovSmirnovTest is not working on Serverless compute.</title>
      <link>https://community.databricks.com/t5/data-engineering/from-pyspark-ml-stat-import-kolmogorovsmirnovtest-is-not-working/m-p/133319#M49798</link>
      <description>&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Hi everyone,&lt;/P&gt;&lt;P&gt;I am trying to run a Kolmogorov–Smirnov (KS) test on a Spark DataFrame column in Databricks using the built-in pyspark.ml.stat.KolmogorovSmirnovTest. The goal is to apply the KS test directly on Spark DataFrame data without converting it into Pandas or NumPy.&lt;/P&gt;&lt;P&gt;Here’s the snippet I’m using:&lt;/P&gt;&lt;DIV class=""&gt;&lt;DIV class=""&gt;&lt;SPAN&gt;&lt;SPAN&gt;&lt;SPAN class=""&gt;from pyspark.ml.stat &lt;SPAN class=""&gt;import KolmogorovSmirnovTest result = KolmogorovSmirnovTest.test(df, &lt;SPAN class=""&gt;"value", &lt;SPAN class=""&gt;"norm", &lt;SPAN class=""&gt;0.0, &lt;SPAN class=""&gt;1.0).collect()[&lt;SPAN class=""&gt;0] &lt;SPAN class=""&gt;print(result.statistic, result.pValue)&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;P&gt;And the error I get is:&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;DIV class=""&gt;&lt;DIV class=""&gt;&lt;DIV class=""&gt;&lt;DIV class=""&gt;&amp;nbsp;&lt;DIV class=""&gt;&lt;DIV&gt;&lt;DIV class=""&gt;&lt;SPAN class=""&gt;&lt;SPAN class=""&gt;AssertionError:&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;DIV&gt;&lt;DIV class=""&gt;File &lt;SPAN class=""&gt;&lt;SPAN class=""&gt;&lt;A target="_blank"&gt;&amp;lt;command-8698289323550994&amp;gt;, line 26 &lt;SPAN&gt;23 &lt;SPAN class=""&gt;else: &lt;SPAN&gt;24 &lt;SPAN class=""&gt;return df&lt;SPAN&gt;.sparkSession&lt;SPAN&gt;.createDataFrame([], &lt;SPAN&gt;"&lt;SPAN&gt;column STRING, statistic DOUBLE, pValue DOUBLE&lt;SPAN&gt;") &lt;SPAN class=""&gt;---&amp;gt; 26 out &lt;SPAN&gt;= summarize_normality_numeric_columns_test(df&lt;SPAN&gt;=spark_df, table_name&lt;SPAN&gt;=&lt;SPAN&gt;"&lt;SPAN&gt;sample&lt;SPAN&gt;") &lt;SPAN&gt;27 display(out)&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/A&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;DIV class=""&gt;&lt;HR /&gt;&lt;DIV class=""&gt;File &lt;SPAN class=""&gt;&lt;SPAN class=""&gt;/databricks/python/lib/python3.12/site-packages/pyspark/ml/stat.py:249, in &lt;SPAN class=""&gt;KolmogorovSmirnovTest.test&lt;SPAN class=""&gt;(dataset, sampleCol, distName, *params) &lt;SPAN&gt;246 &lt;SPAN class=""&gt;from &lt;SPAN class=""&gt;pyspark&lt;SPAN class=""&gt;.&lt;SPAN class=""&gt;core&lt;SPAN class=""&gt;.&lt;SPAN class=""&gt;context &lt;SPAN class=""&gt;import SparkContext &lt;SPAN&gt;248 sc &lt;SPAN&gt;= SparkContext&lt;SPAN&gt;._active_spark_context &lt;SPAN class=""&gt;--&amp;gt; 249 &lt;SPAN class=""&gt;assert sc &lt;SPAN class=""&gt;is &lt;SPAN class=""&gt;not &lt;SPAN class=""&gt;None &lt;SPAN&gt;251 javaTestObj &lt;SPAN&gt;= &lt;SPAN&gt;getattr(_jvm(), &lt;SPAN&gt;"&lt;SPAN&gt;org.apache.spark.ml.stat.KolmogorovSmirnovTest&lt;SPAN&gt;") &lt;SPAN&gt;252 dataset &lt;SPAN&gt;= _py2java(sc, dataset)&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;P&gt;&lt;BR /&gt;&lt;BR /&gt;It seems like the KolmogorovSmirnovTest module isn’t supported in Serverless compute, or it behaves differently compared to standard clusters.&lt;/P&gt;&lt;P&gt;Has anyone faced this issue? Is KS test currently unsupported in Databricks Serverless, or is there a workaround?&lt;/P&gt;&lt;P&gt;Thanks in advance!&lt;/P&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;</description>
      <pubDate>Tue, 30 Sep 2025 07:13:42 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/from-pyspark-ml-stat-import-kolmogorovsmirnovtest-is-not-working/m-p/133319#M49798</guid>
      <dc:creator>parthesh24</dc:creator>
      <dc:date>2025-09-30T07:13:42Z</dc:date>
    </item>
    <item>
      <title>Re: from pyspark.ml.stat import KolmogorovSmirnovTest is not working on Serverless compute.</title>
      <link>https://community.databricks.com/t5/data-engineering/from-pyspark-ml-stat-import-kolmogorovsmirnovtest-is-not-working/m-p/133326#M49800</link>
      <description>&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;DIV&gt;&lt;DIV&gt;&lt;P&gt;Hi&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/187879"&gt;@parthesh24&lt;/a&gt;&amp;nbsp;,&lt;/P&gt;&lt;P&gt;It looks more like&amp;nbsp;&lt;SPAN&gt;KolmogorovSmirnovTest module under the hood is trying to access SparkContext which is not supported in serverless.&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="szymon_dybczak_0-1759219675962.png" style="width: 400px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/20303i5F9987CFFD5A600E/image-size/medium?v=v2&amp;amp;px=400" role="button" title="szymon_dybczak_0-1759219675962.png" alt="szymon_dybczak_0-1759219675962.png" /&gt;&lt;/span&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;DIV&gt;&lt;DIV&gt;You can check it yourself by trying to use sparkContext in serverless &lt;span class="lia-unicode-emoji" title=":slightly_smiling_face:"&gt;🙂&lt;/span&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;</description>
      <pubDate>Tue, 30 Sep 2025 08:08:56 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/from-pyspark-ml-stat-import-kolmogorovsmirnovtest-is-not-working/m-p/133326#M49800</guid>
      <dc:creator>szymon_dybczak</dc:creator>
      <dc:date>2025-09-30T08:08:56Z</dc:date>
    </item>
    <item>
      <title>Re: from pyspark.ml.stat import KolmogorovSmirnovTest is not working on Serverless compute.</title>
      <link>https://community.databricks.com/t5/data-engineering/from-pyspark-ml-stat-import-kolmogorovsmirnovtest-is-not-working/m-p/133328#M49801</link>
      <description>&lt;P&gt;So, is there any way I can perform&amp;nbsp;&lt;SPAN&gt;KolmogorovSmirnovTest in serverless compute?&lt;/SPAN&gt;&lt;/P&gt;</description>
      <pubDate>Tue, 30 Sep 2025 08:14:39 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/from-pyspark-ml-stat-import-kolmogorovsmirnovtest-is-not-working/m-p/133328#M49801</guid>
      <dc:creator>parthesh24</dc:creator>
      <dc:date>2025-09-30T08:14:39Z</dc:date>
    </item>
    <item>
      <title>Re: from pyspark.ml.stat import KolmogorovSmirnovTest is not working on Serverless compute.</title>
      <link>https://community.databricks.com/t5/data-engineering/from-pyspark-ml-stat-import-kolmogorovsmirnovtest-is-not-working/m-p/133330#M49802</link>
      <description>&lt;P&gt;If we're talking about this&amp;nbsp;&amp;nbsp;&lt;SPAN&gt;KolmogorovSmirnovTest&lt;/SPAN&gt;&amp;nbsp;&lt;SPAN&gt;from this particular module -&amp;gt; pyspark.ml.stat - then no. The reason is explained in above answer.&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;When you look at soruce code we can clearly see sparkContext being used - so if you want to use it you have to change serverless to classic compute&lt;/SPAN&gt;&lt;/P&gt;&lt;LI-CODE lang="python"&gt;class KolmogorovSmirnovTest:
    """
    Conduct the two-sided Kolmogorov Smirnov (KS) test for data sampled from a continuous
    distribution.

    By comparing the largest difference between the empirical cumulative
    distribution of the sample data and the theoretical distribution we can provide a test for the
    the null hypothesis that the sample data comes from that theoretical distribution.

    .. versionadded:: 2.4.0

    """

[docs]    @staticmethod
    def test(dataset: DataFrame, sampleCol: str, distName: str, *params: float) -&amp;gt; DataFrame:
        """
        Conduct a one-sample, two-sided Kolmogorov-Smirnov test for probability distribution
        equality. Currently supports the normal distribution, taking as parameters the mean and
        standard deviation.

        .. versionadded:: 2.4.0

        Parameters
        ----------
        dataset : :py:class:`pyspark.sql.DataFrame`
            a Dataset or a DataFrame containing the sample of data to test.
        sampleCol : str
            Name of sample column in dataset, of any numerical type.
        distName : str
            a `string` name for a theoretical distribution, currently only support "norm".
        params : float
            a list of `float` values specifying the parameters to be used for the theoretical
            distribution. For "norm" distribution, the parameters includes mean and variance.

        Returns
        -------
        A DataFrame that contains the Kolmogorov-Smirnov test result for the input sampled data.
        This DataFrame will contain a single Row with the following fields:

        - `pValue: Double`
        - `statistic: Double`

        Examples
        --------
        &amp;gt;&amp;gt;&amp;gt; from pyspark.ml.stat import KolmogorovSmirnovTest
        &amp;gt;&amp;gt;&amp;gt; dataset = [[-1.0], [0.0], [1.0]]
        &amp;gt;&amp;gt;&amp;gt; dataset = spark.createDataFrame(dataset, ['sample'])
        &amp;gt;&amp;gt;&amp;gt; ksResult = KolmogorovSmirnovTest.test(dataset, 'sample', 'norm', 0.0, 1.0).first()
        &amp;gt;&amp;gt;&amp;gt; round(ksResult.pValue, 3)
        1.0
        &amp;gt;&amp;gt;&amp;gt; round(ksResult.statistic, 3)
        0.175
        &amp;gt;&amp;gt;&amp;gt; dataset = [[2.0], [3.0], [4.0]]
        &amp;gt;&amp;gt;&amp;gt; dataset = spark.createDataFrame(dataset, ['sample'])
        &amp;gt;&amp;gt;&amp;gt; ksResult = KolmogorovSmirnovTest.test(dataset, 'sample', 'norm', 3.0, 1.0).first()
        &amp;gt;&amp;gt;&amp;gt; round(ksResult.pValue, 3)
        1.0
        &amp;gt;&amp;gt;&amp;gt; round(ksResult.statistic, 3)
        0.175
        """
        if is_remote():
            return invoke_helper_relation(
                "kolmogorovSmirnovTest",
                dataset,
                sampleCol,
                distName,
                ([float(p) for p in params], ArrayType(DoubleType())),
            )

        else:
            from pyspark.core.context import SparkContext

            sc = SparkContext._active_spark_context
            assert sc is not None

            javaTestObj = getattr(_jvm(), "org.apache.spark.ml.stat.KolmogorovSmirnovTest")
            dataset = _py2java(sc, dataset)
            params = [float(param) for param in params]  # type: ignore[assignment]
            return _java2py(
                sc,
                javaTestObj.test(
                    dataset,
                    sampleCol,
                    distName,
                    _jvm().PythonUtils.toSeq(params),
                ),
            )&lt;/LI-CODE&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;So you can try to search for different implementation of KolomogorvSmirnovTest that doesn't use sparkContext. Try to search github and maybe you'll find somethig that will work for you. Maybe something like below:&lt;/P&gt;&lt;P&gt;&lt;A href="https://github.com/Davi-Schumacher/KS-2Samp-PySparkSQL/blob/master/ks_2samp_sparksql.py" target="_blank"&gt;KS-2Samp-PySparkSQL/ks_2samp_sparksql.py at master · Davi-Schumacher/KS-2Samp-PySparkSQL · GitHub&lt;/A&gt;&lt;/P&gt;&lt;P&gt;And keep in mind that serverless has other limitations as well:&lt;/P&gt;&lt;P&gt;&lt;A href="https://learn.microsoft.com/en-us/azure/databricks/compute/serverless/limitations" target="_blank"&gt;Serverless compute limitations - Azure Databricks | Microsoft Learn&lt;/A&gt;&lt;/P&gt;</description>
      <pubDate>Tue, 30 Sep 2025 08:50:01 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/from-pyspark-ml-stat-import-kolmogorovsmirnovtest-is-not-working/m-p/133330#M49802</guid>
      <dc:creator>szymon_dybczak</dc:creator>
      <dc:date>2025-09-30T08:50:01Z</dc:date>
    </item>
  </channel>
</rss>

