<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic What's the best way to get from Python dict &amp;gt; JSON &amp;gt; PySpark and apply as a mapping to a dataframe? in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/what-s-the-best-way-to-get-from-python-dict-gt-json-gt-pyspark/m-p/114325#M44786</link>
    <description>&lt;P&gt;I'm migrating code from Python Linux to Databricks PySpark. I have many mappings like this:&amp;nbsp;&lt;/P&gt;&lt;P&gt;{&lt;BR /&gt;&amp;nbsp; &amp;nbsp; "main": {&lt;BR /&gt;&amp;nbsp; &amp;nbsp; "honda": 1.0,&lt;BR /&gt;&amp;nbsp; &amp;nbsp; "toyota": 2.9,&lt;BR /&gt;&amp;nbsp; &amp;nbsp; "BMW": 5.77,&lt;BR /&gt;&amp;nbsp; &amp;nbsp; "Fiat": 4.5,&lt;BR /&gt;&amp;nbsp; &amp;nbsp; },&lt;BR /&gt;}&lt;/P&gt;&lt;P&gt;I exported using json.dump, saved to s3 and was able to import with spark.read.json, but that puts it into a dataframe with nested objects, like this. I can access the objects with select but don't know how to apply it as a mapping to another dataframe.&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="397973_0-1743620626332.png" style="width: 400px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/15779i2D1965831C57A558/image-size/medium?v=v2&amp;amp;px=400" role="button" title="397973_0-1743620626332.png" alt="397973_0-1743620626332.png" /&gt;&lt;/span&gt;&lt;/P&gt;&lt;P&gt;Another approach is to save as JSON Lines and read in as a dictionary? From what I understand PySpark prefers to have JSON lines rather than usual JSON, like this:&amp;nbsp;&lt;/P&gt;&lt;P&gt;{"main":{"honda":1,"toyota":2,"BMW":5,"Fiat":4}}&lt;/P&gt;&lt;P&gt;Ok so I manually created the dict and mapping. Given a dictionary car_map with the values, and a dataframe df:&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;map_col &lt;/SPAN&gt;&lt;SPAN&gt;=&lt;/SPAN&gt;&lt;SPAN&gt; F.&lt;/SPAN&gt;&lt;SPAN&gt;create_map&lt;/SPAN&gt;&lt;SPAN&gt;([F.&lt;/SPAN&gt;&lt;SPAN&gt;lit&lt;/SPAN&gt;&lt;SPAN&gt;(x) &lt;/SPAN&gt;&lt;SPAN&gt;for&lt;/SPAN&gt;&lt;SPAN&gt; i &lt;/SPAN&gt;&lt;SPAN&gt;in&lt;/SPAN&gt;&lt;SPAN&gt;&amp;nbsp;car_map.&lt;/SPAN&gt;&lt;SPAN&gt;items&lt;/SPAN&gt;&lt;SPAN&gt;() &lt;/SPAN&gt;&lt;SPAN&gt;for&lt;/SPAN&gt;&lt;SPAN&gt; x &lt;/SPAN&gt;&lt;SPAN&gt;in&lt;/SPAN&gt;&lt;SPAN&gt; i])&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;df.&lt;/SPAN&gt;&lt;SPAN&gt;withColumn&lt;/SPAN&gt;&lt;SPAN&gt;(&lt;/SPAN&gt;&lt;SPAN&gt;'x'&lt;/SPAN&gt;&lt;SPAN&gt;, map_col[F.&lt;/SPAN&gt;&lt;SPAN&gt;col&lt;/SPAN&gt;&lt;SPAN&gt;(&lt;/SPAN&gt;&lt;SPAN&gt;'car'&lt;/SPAN&gt;&lt;SPAN&gt;)]).&lt;/SPAN&gt;&lt;SPAN&gt;display&lt;/SPAN&gt;&lt;SPAN&gt;()&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;So then how do I create the JSON lines in Python? It would be a list of dicts but I need to label them. That would make it a JSON rather than JSON lines?&amp;nbsp;&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;What approach do you think is better, or is there another?&amp;nbsp;&lt;/SPAN&gt;&lt;/DIV&gt;&lt;/DIV&gt;</description>
    <pubDate>Wed, 02 Apr 2025 19:13:24 GMT</pubDate>
    <dc:creator>397973</dc:creator>
    <dc:date>2025-04-02T19:13:24Z</dc:date>
    <item>
      <title>What's the best way to get from Python dict &gt; JSON &gt; PySpark and apply as a mapping to a dataframe?</title>
      <link>https://community.databricks.com/t5/data-engineering/what-s-the-best-way-to-get-from-python-dict-gt-json-gt-pyspark/m-p/114325#M44786</link>
      <description>&lt;P&gt;I'm migrating code from Python Linux to Databricks PySpark. I have many mappings like this:&amp;nbsp;&lt;/P&gt;&lt;P&gt;{&lt;BR /&gt;&amp;nbsp; &amp;nbsp; "main": {&lt;BR /&gt;&amp;nbsp; &amp;nbsp; "honda": 1.0,&lt;BR /&gt;&amp;nbsp; &amp;nbsp; "toyota": 2.9,&lt;BR /&gt;&amp;nbsp; &amp;nbsp; "BMW": 5.77,&lt;BR /&gt;&amp;nbsp; &amp;nbsp; "Fiat": 4.5,&lt;BR /&gt;&amp;nbsp; &amp;nbsp; },&lt;BR /&gt;}&lt;/P&gt;&lt;P&gt;I exported using json.dump, saved to s3 and was able to import with spark.read.json, but that puts it into a dataframe with nested objects, like this. I can access the objects with select but don't know how to apply it as a mapping to another dataframe.&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="397973_0-1743620626332.png" style="width: 400px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/15779i2D1965831C57A558/image-size/medium?v=v2&amp;amp;px=400" role="button" title="397973_0-1743620626332.png" alt="397973_0-1743620626332.png" /&gt;&lt;/span&gt;&lt;/P&gt;&lt;P&gt;Another approach is to save as JSON Lines and read in as a dictionary? From what I understand PySpark prefers to have JSON lines rather than usual JSON, like this:&amp;nbsp;&lt;/P&gt;&lt;P&gt;{"main":{"honda":1,"toyota":2,"BMW":5,"Fiat":4}}&lt;/P&gt;&lt;P&gt;Ok so I manually created the dict and mapping. Given a dictionary car_map with the values, and a dataframe df:&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;map_col &lt;/SPAN&gt;&lt;SPAN&gt;=&lt;/SPAN&gt;&lt;SPAN&gt; F.&lt;/SPAN&gt;&lt;SPAN&gt;create_map&lt;/SPAN&gt;&lt;SPAN&gt;([F.&lt;/SPAN&gt;&lt;SPAN&gt;lit&lt;/SPAN&gt;&lt;SPAN&gt;(x) &lt;/SPAN&gt;&lt;SPAN&gt;for&lt;/SPAN&gt;&lt;SPAN&gt; i &lt;/SPAN&gt;&lt;SPAN&gt;in&lt;/SPAN&gt;&lt;SPAN&gt;&amp;nbsp;car_map.&lt;/SPAN&gt;&lt;SPAN&gt;items&lt;/SPAN&gt;&lt;SPAN&gt;() &lt;/SPAN&gt;&lt;SPAN&gt;for&lt;/SPAN&gt;&lt;SPAN&gt; x &lt;/SPAN&gt;&lt;SPAN&gt;in&lt;/SPAN&gt;&lt;SPAN&gt; i])&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;df.&lt;/SPAN&gt;&lt;SPAN&gt;withColumn&lt;/SPAN&gt;&lt;SPAN&gt;(&lt;/SPAN&gt;&lt;SPAN&gt;'x'&lt;/SPAN&gt;&lt;SPAN&gt;, map_col[F.&lt;/SPAN&gt;&lt;SPAN&gt;col&lt;/SPAN&gt;&lt;SPAN&gt;(&lt;/SPAN&gt;&lt;SPAN&gt;'car'&lt;/SPAN&gt;&lt;SPAN&gt;)]).&lt;/SPAN&gt;&lt;SPAN&gt;display&lt;/SPAN&gt;&lt;SPAN&gt;()&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;So then how do I create the JSON lines in Python? It would be a list of dicts but I need to label them. That would make it a JSON rather than JSON lines?&amp;nbsp;&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;What approach do you think is better, or is there another?&amp;nbsp;&lt;/SPAN&gt;&lt;/DIV&gt;&lt;/DIV&gt;</description>
      <pubDate>Wed, 02 Apr 2025 19:13:24 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/what-s-the-best-way-to-get-from-python-dict-gt-json-gt-pyspark/m-p/114325#M44786</guid>
      <dc:creator>397973</dc:creator>
      <dc:date>2025-04-02T19:13:24Z</dc:date>
    </item>
    <item>
      <title>Re: What's the best way to get from Python dict &gt; JSON &gt; PySpark and apply as a mapping to a d</title>
      <link>https://community.databricks.com/t5/data-engineering/what-s-the-best-way-to-get-from-python-dict-gt-json-gt-pyspark/m-p/114439#M44825</link>
      <description>&lt;P&gt;For migrating your Python dictionary mappings to PySpark, you have several good options. Let's examine the approaches and identify the best solution.&lt;/P&gt;
&lt;P&gt;Using F.create_map (Your Current Approach)&lt;/P&gt;
&lt;P&gt;Your current approach using `F.create_map` is actually quite efficient:&lt;/P&gt;
&lt;P&gt;```python&lt;BR /&gt;from pyspark.sql import functions as F&lt;/P&gt;
&lt;P&gt;# Your dictionary&lt;BR /&gt;car_map = {&lt;BR /&gt;"honda": 1.0,&lt;BR /&gt;"toyota": 2.9,&lt;BR /&gt;"BMW": 5.77,&lt;BR /&gt;"Fiat": 4.5&lt;BR /&gt;}&lt;/P&gt;
&lt;P&gt;Create a map column and apply it&lt;BR /&gt;map_col = F.create_map([F.lit(x) for i in car_map.items() for x in i])&lt;BR /&gt;df = df.withColumn('mapped_value', map_col[F.col('car')])&lt;BR /&gt;```&lt;/P&gt;
&lt;P&gt;This is a clean approach that works well for smaller mappings and doesn't require any intermediate storage.&lt;/P&gt;
&lt;P&gt;Using Broadcast Join (Best for Large Mappings)&lt;/P&gt;
&lt;P&gt;For larger mappings, a broadcast join is often more efficient:&lt;/P&gt;
&lt;P&gt;```python&lt;BR /&gt;Convert your dictionary to a DataFrame&lt;BR /&gt;map_df = spark.createDataFrame([(k, v) for k, v in car_map.items()], ["car", "value"])&lt;/P&gt;
&lt;P&gt;Broadcast join with your original DataFrame&lt;BR /&gt;df = df.join(F.broadcast(map_df), "car", "left").na.fill(0, ["value"])&lt;BR /&gt;```Using JSON Lines Approach&lt;/P&gt;
&lt;P&gt;If you prefer to store your mappings as files for reuse:&lt;/P&gt;
&lt;P&gt;1. Create JSON Lines from your Python dictionary:&lt;/P&gt;
&lt;P&gt;```python&lt;BR /&gt;import json&lt;/P&gt;
&lt;P&gt;# For a simple flat dictionary&lt;BR /&gt;with open("car_mapping.jsonl", "w") as f:&lt;BR /&gt;for key, value in car_map.items():&lt;BR /&gt;f.write(json.dumps({"car": key, "value": value}) + "\n")&lt;/P&gt;
&lt;P&gt;# For nested dictionaries like your original example&lt;BR /&gt;nested_map = {"main": car_map}&lt;BR /&gt;with open("nested_car_mapping.jsonl", "w") as f:&lt;BR /&gt;for category, mappings in nested_map.items():&lt;BR /&gt;for key, value in mappings.items():&lt;BR /&gt;f.write(json.dumps({"category": category, "car": key, "value": value}) + "\n")&lt;BR /&gt;```&lt;/P&gt;
&lt;P&gt;2. Read it in Databricks:&lt;/P&gt;
&lt;P&gt;```python&lt;BR /&gt;map_df = spark.read.json("dbfs:/path/to/car_mapping.jsonl")&lt;BR /&gt;df = df.join(F.broadcast(map_df), "car", "left")&lt;BR /&gt;```&lt;/P&gt;
&lt;P&gt;## Recommendation&lt;/P&gt;
&lt;P&gt;1. For small to medium mappings: Your current `F.create_map` approach is excellent - it's clean, efficient, and doesn't require intermediate storage.&lt;/P&gt;
&lt;P&gt;2. For large mappings (thousands of entries): Use the broadcast join approach, which scales better.&lt;/P&gt;
&lt;P&gt;3. For mappings that need to be reused across sessions: Store as JSON Lines and load with `spark.read.json()` followed by a broadcast join.&lt;/P&gt;
&lt;P&gt;The JSON Lines format (one JSON object per line) is indeed preferred in Spark over nested JSON, as it allows for parallel processing and is more efficient for distributed systems.&lt;/P&gt;
&lt;P&gt;If you need to maintain the nested structure, you can also use a UDF, but this is generally less efficient than the other approaches:&lt;/P&gt;
&lt;P&gt;```python&lt;BR /&gt;from pyspark.sql.functions import udf&lt;BR /&gt;from pyspark.sql.types import DoubleType&lt;/P&gt;
&lt;P&gt;# Create a UDF that looks up values in your dictionary&lt;BR /&gt;@udf(returnType=DoubleType())&lt;BR /&gt;def map_car(car_name):&lt;BR /&gt;return car_map.get(car_name, 0.0)&lt;/P&gt;
&lt;P&gt;# Apply the UDF&lt;BR /&gt;df = df.withColumn("mapped_value", map_car(F.col("car")))&lt;BR /&gt;```&lt;/P&gt;
&lt;P&gt;Overall, your `F.create_map` approach is already quite good for most use cases!&lt;/P&gt;</description>
      <pubDate>Thu, 03 Apr 2025 17:18:47 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/what-s-the-best-way-to-get-from-python-dict-gt-json-gt-pyspark/m-p/114439#M44825</guid>
      <dc:creator>Louis_Frolio</dc:creator>
      <dc:date>2025-04-03T17:18:47Z</dc:date>
    </item>
  </channel>
</rss>

