<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: pySpark Dataframe to DeepLearning model in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/pyspark-dataframe-to-deeplearning-model/m-p/26168#M18281</link>
    <description>&lt;P&gt;Thank you I have made a solution based on your idea.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;</description>
    <pubDate>Thu, 10 Mar 2022 08:39:28 GMT</pubDate>
    <dc:creator>imgaboy</dc:creator>
    <dc:date>2022-03-10T08:39:28Z</dc:date>
    <item>
      <title>pySpark Dataframe to DeepLearning model</title>
      <link>https://community.databricks.com/t5/data-engineering/pyspark-dataframe-to-deeplearning-model/m-p/26165#M18278</link>
      <description>&lt;P&gt;I have a large time series with many measuring stations recording the same 5 data (Temperature, Humidity, etc.) I want to predict a future moment with a time series model, for which I pass the data from all the measuring stations to the Deep Learning model. E.g. I have 100 days of recorded data, in which I have 1000 measuring stations, and each of them has 5 data. My spark table looks something like this:&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper" image-alt="image"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/2047i550A328D8E681418/image-size/large?v=v2&amp;amp;px=999" role="button" title="image" alt="image" /&gt;&lt;/span&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;How can I efficiently transpose my data using Spark and Pandas to something like this structure?&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper" image-alt="image"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/2046i30D3E6DB45F390F8/image-size/large?v=v2&amp;amp;px=999" role="button" title="image" alt="image" /&gt;&lt;/span&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Sample code:&lt;/P&gt;&lt;PRE&gt;&lt;CODE&gt;import pandas as pd 
import random 
import spark
&amp;nbsp;
data = []
for node in range(0,100):
    for day in range(0,100):
        data.append([str(node),
                     day,
                     random.randrange(15, 25, 1),
                     random.randrange(50, 100, 1),
                     random.randrange(1000, 1045, 1)])
        
df = spark.createDataFrame(data,['Node', 'day','Temp','hum','press'])
display(df)&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;I don't process the data as a whole, but I use a period between two dates, say 10, 20, or 30 time instants.&lt;/P&gt;&lt;P&gt;A simple solution, like packing everything into memory... I can do it, but I don't know how to use it efficiently. My original dataset is a parquet of 8 columns and 6million rows (2000 NODE), so if I load it all into memory and transform it that way, I would get a 30,000 row (time) and 2000 * 8 columns table in memory.&lt;/P&gt;&lt;P&gt;&lt;B&gt;So my question is how to load and transform such data effectively&lt;/B&gt;. I have parquet data on the disk and I load it to the machine with spark. I will pass the processed data to a Deep Learning model.&lt;/P&gt;</description>
      <pubDate>Tue, 08 Mar 2022 18:11:01 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/pyspark-dataframe-to-deeplearning-model/m-p/26165#M18278</guid>
      <dc:creator>imgaboy</dc:creator>
      <dc:date>2022-03-08T18:11:01Z</dc:date>
    </item>
    <item>
      <title>Re: pySpark Dataframe to DeepLearning model</title>
      <link>https://community.databricks.com/t5/data-engineering/pyspark-dataframe-to-deeplearning-model/m-p/26166#M18279</link>
      <description>&lt;PRE&gt;&lt;CODE&gt;df.groupBy("date").pivot("Node").agg(first("Temp"))&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;It is converting to classic crosstable so pivot will help. Example above.&lt;/P&gt;</description>
      <pubDate>Tue, 08 Mar 2022 19:02:04 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/pyspark-dataframe-to-deeplearning-model/m-p/26166#M18279</guid>
      <dc:creator>Hubert-Dudek</dc:creator>
      <dc:date>2022-03-08T19:02:04Z</dc:date>
    </item>
    <item>
      <title>Re: pySpark Dataframe to DeepLearning model</title>
      <link>https://community.databricks.com/t5/data-engineering/pyspark-dataframe-to-deeplearning-model/m-p/26167#M18280</link>
      <description>&lt;P&gt;I am not really sure how this solves my problem. Will this solve the memory overload problem and will the model get the right data? Besides, if you filter the data by timestamps on the&lt;/P&gt;&lt;P&gt; spark, it slows down the data processing very much..... Is there some kind of generator for spark like the Keras data generator?&lt;/P&gt;</description>
      <pubDate>Tue, 08 Mar 2022 20:21:50 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/pyspark-dataframe-to-deeplearning-model/m-p/26167#M18280</guid>
      <dc:creator>imgaboy</dc:creator>
      <dc:date>2022-03-08T20:21:50Z</dc:date>
    </item>
    <item>
      <title>Re: pySpark Dataframe to DeepLearning model</title>
      <link>https://community.databricks.com/t5/data-engineering/pyspark-dataframe-to-deeplearning-model/m-p/26168#M18281</link>
      <description>&lt;P&gt;Thank you I have made a solution based on your idea.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Thu, 10 Mar 2022 08:39:28 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/pyspark-dataframe-to-deeplearning-model/m-p/26168#M18281</guid>
      <dc:creator>imgaboy</dc:creator>
      <dc:date>2022-03-10T08:39:28Z</dc:date>
    </item>
    <item>
      <title>Re: pySpark Dataframe to DeepLearning model</title>
      <link>https://community.databricks.com/t5/data-engineering/pyspark-dataframe-to-deeplearning-model/m-p/26169#M18282</link>
      <description>&lt;P&gt;great. Can you select my answer as the best one?&lt;/P&gt;</description>
      <pubDate>Thu, 10 Mar 2022 13:26:03 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/pyspark-dataframe-to-deeplearning-model/m-p/26169#M18282</guid>
      <dc:creator>Hubert-Dudek</dc:creator>
      <dc:date>2022-03-10T13:26:03Z</dc:date>
    </item>
  </channel>
</rss>

