<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Difference between Databricks and local pyspark split. in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/difference-between-databricks-and-local-pyspark-split/m-p/5876#M2149</link>
    <description>&lt;P&gt;Hi,&lt;/P&gt;&lt;P&gt;In Spark 3.0 and later versions, the default behavior of the split() function with an empty delimiter is to include an empty string at the beginning of the resulting array so that is the reason it is showing 4 .  &lt;/P&gt;&lt;P&gt;&lt;/P&gt;</description>
    <pubDate>Fri, 14 Apr 2023 07:22:51 GMT</pubDate>
    <dc:creator>JAHNAVI</dc:creator>
    <dc:date>2023-04-14T07:22:51Z</dc:date>
    <item>
      <title>Difference between Databricks and local pyspark split.</title>
      <link>https://community.databricks.com/t5/data-engineering/difference-between-databricks-and-local-pyspark-split/m-p/5875#M2148</link>
      <description>&lt;P&gt;I have noticed some inconsistent behavior between calling the 'split' fuction on databricks and on my local installation.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Running it in a databricks notebook gives&lt;/P&gt;&lt;PRE&gt;&lt;CODE&gt;spark.sql("SELECT split('abc', ''), size(split('abc',''))").show()&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper" image-alt="image.png"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/358i6194CDABBDB77312/image-size/large?v=v2&amp;amp;px=999" role="button" title="image.png" alt="image.png" /&gt;&lt;/span&gt;So the string is split in 3 parts.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;If I run on my local install I get 4 parts.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Locally I'm running on pyspark 3.2.1, on databricks I've tried it with multiple versions but they all give the same result.&lt;/P&gt;</description>
      <pubDate>Fri, 14 Apr 2023 07:13:55 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/difference-between-databricks-and-local-pyspark-split/m-p/5875#M2148</guid>
      <dc:creator>Merchiv</dc:creator>
      <dc:date>2023-04-14T07:13:55Z</dc:date>
    </item>
    <item>
      <title>Re: Difference between Databricks and local pyspark split.</title>
      <link>https://community.databricks.com/t5/data-engineering/difference-between-databricks-and-local-pyspark-split/m-p/5876#M2149</link>
      <description>&lt;P&gt;Hi,&lt;/P&gt;&lt;P&gt;In Spark 3.0 and later versions, the default behavior of the split() function with an empty delimiter is to include an empty string at the beginning of the resulting array so that is the reason it is showing 4 .  &lt;/P&gt;&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Fri, 14 Apr 2023 07:22:51 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/difference-between-databricks-and-local-pyspark-split/m-p/5876#M2149</guid>
      <dc:creator>JAHNAVI</dc:creator>
      <dc:date>2023-04-14T07:22:51Z</dc:date>
    </item>
    <item>
      <title>Re: Difference between Databricks and local pyspark split.</title>
      <link>https://community.databricks.com/t5/data-engineering/difference-between-databricks-and-local-pyspark-split/m-p/5877#M2150</link>
      <description>&lt;P&gt;@Ivo Merchiers​&amp;nbsp;:&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;The behavior you are seeing is likely due to differences in the underlying version of Apache Spark between your local installation and Databricks. &lt;/P&gt;&lt;P&gt;split() is a function provided by Spark's SQL functions, and different versions of Spark may have differences in their implementation of these functions. You mentioned that you are using PySpark version 3.2.1 locally. To confirm which version of Spark is being used, you can run the following command in your PySpark shell:&lt;/P&gt;&lt;PRE&gt;&lt;CODE&gt;import pyspark
print(pyspark.__version__)&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;You can then check the corresponding version of Spark and its SQL functions documentation for the &lt;/P&gt;&lt;P&gt;split() function behavior. On Databricks, you can check the version of Spark being used by running the command:&lt;/P&gt;&lt;PRE&gt;&lt;CODE&gt;spark.version&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;If you are seeing different results for split() between your local installation and Databricks, you may need to adjust your code to handle the differences in behavior or use the same version of Spark across both environments.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Sun, 16 Apr 2023 07:26:44 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/difference-between-databricks-and-local-pyspark-split/m-p/5877#M2150</guid>
      <dc:creator>Anonymous</dc:creator>
      <dc:date>2023-04-16T07:26:44Z</dc:date>
    </item>
    <item>
      <title>Re: Difference between Databricks and local pyspark split.</title>
      <link>https://community.databricks.com/t5/data-engineering/difference-between-databricks-and-local-pyspark-split/m-p/5878#M2151</link>
      <description>&lt;P&gt;Hi,&lt;/P&gt;&lt;P&gt;My databricks cluster runs spark 3.3, but does give a length of 3.&lt;span class="lia-inline-image-display-wrapper" image-alt="image"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/361iB7F4D651A9605B26/image-size/large?v=v2&amp;amp;px=999" role="button" title="image" alt="image" /&gt;&lt;/span&gt;Is there something different about the databricks implementation of pyspark or should it use the same standards?&lt;/P&gt;</description>
      <pubDate>Mon, 17 Apr 2023 08:34:43 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/difference-between-databricks-and-local-pyspark-split/m-p/5878#M2151</guid>
      <dc:creator>Merchiv</dc:creator>
      <dc:date>2023-04-17T08:34:43Z</dc:date>
    </item>
    <item>
      <title>Re: Difference between Databricks and local pyspark split.</title>
      <link>https://community.databricks.com/t5/data-engineering/difference-between-databricks-and-local-pyspark-split/m-p/5879#M2152</link>
      <description>&lt;P&gt;Thank you for the suggestion, but even with the same spark version there seems to be a difference between what is happening locally and what happens on a databricks cluster.&lt;/P&gt;</description>
      <pubDate>Mon, 17 Apr 2023 08:36:20 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/difference-between-databricks-and-local-pyspark-split/m-p/5879#M2152</guid>
      <dc:creator>Merchiv</dc:creator>
      <dc:date>2023-04-17T08:36:20Z</dc:date>
    </item>
  </channel>
</rss>

