<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: I want to split a dataframe with date range 1 week, with each week data in different column. in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/i-want-to-split-a-dataframe-with-date-range-1-week-with-each/m-p/29081#M20838</link>
    <description>&lt;P&gt;It should just be a matter of applying the correct set of transformations:&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;You can start by adding the week-of-year to each record with the command &lt;B&gt;pyspark.sql.functions.weekofyear(..) &lt;/B&gt;and name it something like &lt;B&gt;weekOfYear&lt;/B&gt;. See &lt;A href="https://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=functions#pyspark.sql.functions.weekofyear" alt="https://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=functions#pyspark.sql.functions.weekofyear" target="_blank"&gt;https://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=functions#pyspark.sql.functions.weekofyear&lt;/A&gt;&lt;/LI&gt;&lt;LI&gt;To get the day-of-week, you can use &lt;B&gt;pyspark.sql.functions.date_format(..) &lt;/B&gt;with the format of "u" which yields the number of the week and name it something like &lt;B&gt;dayOfWeek&lt;/B&gt; See &lt;A href="https://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=functions#pyspark.sql.functions.date_format" alt="https://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=functions#pyspark.sql.functions.date_format" target="_blank"&gt;https://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=functions#pyspark.sql.functions.date_format&lt;/A&gt; and then &lt;A href="https://docs.oracle.com/javase/8/docs/api/java/text/SimpleDateFormat.html." alt="https://docs.oracle.com/javase/8/docs/api/java/text/SimpleDateFormat.html." target="_blank"&gt;https://docs.oracle.com/javase/8/docs/api/java/text/SimpleDateFormat.html.&lt;/A&gt;&lt;/LI&gt;&lt;LI&gt;Next would be to add one column for each day of the week (seven steps in all).&lt;UL&gt;&lt;LI&gt;The first part would be to use pyspark.sql.functions.&lt;B&gt;when(   &lt;I&gt;condition&lt;/I&gt;, &lt;I&gt;value&lt;/I&gt;&lt;/B&gt;&lt;/LI&gt;&lt;LI&gt;&amp;nbsp;&lt;/LI&gt;&lt;LI&gt;&amp;nbsp;&lt;/LI&gt;&lt;LI&gt;&amp;nbsp;&lt;/LI&gt;&lt;LI&gt;&lt;B&gt;)&lt;/B&gt;. See &lt;A href="https://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=functions#pyspark.sql.functions.when" alt="https://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=functions#pyspark.sql.functions.when" target="_blank"&gt;https://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=functions#pyspark.sql.functions.when&lt;/A&gt;&lt;/LI&gt;&lt;LI&gt;The condition would be something like &lt;B&gt;col("dayOfWeek") == "1" &lt;/B&gt;and the value would be &lt;B&gt;columnWhatever&lt;/B&gt;.&lt;/LI&gt;&lt;LI&gt;You can then append to the &lt;B&gt;when(..) &lt;/B&gt;call an &lt;B&gt;otherwise(..)&lt;/B&gt; expression with &lt;B&gt;0 &lt;/B&gt;as in &lt;B&gt;when(condition, value).otherwise(0)&lt;/B&gt;. See &lt;A href="https://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=functions#pyspark.sql.functions.when" alt="https://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=functions#pyspark.sql.functions.when" target="_blank"&gt;https://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=functions#pyspark.sql.functions.when&lt;/A&gt;&lt;/LI&gt;&lt;LI&gt;Repeat that six more times&lt;/LI&gt;&lt;/UL&gt;&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;At this stage the data will look something like this:&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper" image-alt="screenshot-10.png"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/2528i4107DE577223557F/image-size/large?v=v2&amp;amp;px=999" role="button" title="screenshot-10.png" alt="screenshot-10.png" /&gt;&lt;/span&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;But the real trick is going to be to flatten it and all that should be required is to group by &lt;B&gt;weekOfYear&lt;/B&gt; and then select with that the sum of each column (S,M,T,W,T,F,S) and that last call would look something like this:&lt;/P&gt;&lt;PRE&gt;&lt;CODE&gt;df.groupBy("weekOfYear")
  .sum("Sun", "Mon","Tue", "Wed", "Thu", "Fri", "Sat")&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;--------------- UPDATE ------------&lt;/P&gt;&lt;P&gt;The solution involving &lt;B&gt;sum(..) &lt;/B&gt;works if you have numerical data. However, if you have non-numerical data, we need a slightly different technique. In this case, we can still group by &lt;B&gt;weekOfYear&lt;/B&gt; but instead of using &lt;B&gt;sum&lt;/B&gt;(..) we can use &lt;B&gt;agg(first(...), first(...), ...)&lt;/B&gt;.&lt;/P&gt;&lt;P&gt;Thanks to Adam for the suggestion of using &lt;B&gt;sum(..) &lt;/B&gt;and to @doug for the suggestion of using &lt;B&gt;agg(first(..))&lt;/B&gt;.&lt;/P&gt;&lt;P&gt;As we were playing with this, I did put together a sample notebook that demonstrates &lt;A href="https://databricks-prod-cloudfront.cloud.databricks.com/public/13fe59d17777de29f8a2ffdf85f52925/5638528096339357/2070931/8598139271121446/latest.html" alt="https://databricks-prod-cloudfront.cloud.databricks.com/public/13fe59d17777de29f8a2ffdf85f52925/5638528096339357/2070931/8598139271121446/latest.html" target="_blank"&gt;both solutions&lt;/A&gt; .&lt;/P&gt;</description>
    <pubDate>Wed, 29 Nov 2017 00:24:00 GMT</pubDate>
    <dc:creator>User16857281974</dc:creator>
    <dc:date>2017-11-29T00:24:00Z</dc:date>
    <item>
      <title>I want to split a dataframe with date range 1 week, with each week data in different column.</title>
      <link>https://community.databricks.com/t5/data-engineering/i-want-to-split-a-dataframe-with-date-range-1-week-with-each/m-p/29080#M20837</link>
      <description>&lt;P&gt;&lt;/P&gt;
&lt;P&gt;DF&lt;/P&gt;
&lt;P&gt;&lt;/P&gt; 
&lt;P&gt; Q Date(yyyy-mm-dd)&lt;/P&gt; 
&lt;P&gt; q1 2017-10-01&lt;/P&gt; 
&lt;P&gt; q2 2017-10-03&lt;/P&gt; 
&lt;P&gt; q1 2017-10-09&lt;/P&gt; 
&lt;P&gt; q3 2017-10-06&lt;/P&gt; 
&lt;P&gt; q2 2017-10-01&lt;/P&gt; 
&lt;P&gt; q1 2017-10-13&lt;/P&gt; 
&lt;P&gt; Q1 2017-10-02&lt;/P&gt; 
&lt;P&gt; Q3 2017-10-21&lt;/P&gt; 
&lt;P&gt; Q4 2017-10-17&lt;/P&gt; 
&lt;P&gt; Q5 2017-10-20&lt;/P&gt; 
&lt;P&gt; Q4 2017-10-31&lt;/P&gt; 
&lt;P&gt; Q2 2017-10-27&lt;/P&gt; 
&lt;P&gt; Q5 2017-10-01&lt;/P&gt; 
&lt;P&gt;&lt;/P&gt;
&lt;P&gt;Dataframe:&lt;/P&gt;
&lt;P&gt;&lt;/P&gt; 
&lt;P&gt; Q Count(week 1) Count(week 2) Count(week 3) Count(week 4) Avg(counts) Standard deviation of the counts&lt;/P&gt; 
&lt;P&gt; Q1 2 2 0 0 As applicable As applicable&lt;/P&gt; 
&lt;P&gt; Q2 2 0 0 1 “ “&lt;/P&gt; 
&lt;P&gt; Q3 1 0 1 0 “ “&lt;/P&gt; 
&lt;P&gt; Q4 0 0 1 1&lt;/P&gt; 
&lt;P&gt; Q5 1 0 1 0 “ “&lt;/P&gt;</description>
      <pubDate>Tue, 14 Nov 2017 05:09:37 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/i-want-to-split-a-dataframe-with-date-range-1-week-with-each/m-p/29080#M20837</guid>
      <dc:creator>kkarthik</dc:creator>
      <dc:date>2017-11-14T05:09:37Z</dc:date>
    </item>
    <item>
      <title>Re: I want to split a dataframe with date range 1 week, with each week data in different column.</title>
      <link>https://community.databricks.com/t5/data-engineering/i-want-to-split-a-dataframe-with-date-range-1-week-with-each/m-p/29081#M20838</link>
      <description>&lt;P&gt;It should just be a matter of applying the correct set of transformations:&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;You can start by adding the week-of-year to each record with the command &lt;B&gt;pyspark.sql.functions.weekofyear(..) &lt;/B&gt;and name it something like &lt;B&gt;weekOfYear&lt;/B&gt;. See &lt;A href="https://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=functions#pyspark.sql.functions.weekofyear" alt="https://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=functions#pyspark.sql.functions.weekofyear" target="_blank"&gt;https://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=functions#pyspark.sql.functions.weekofyear&lt;/A&gt;&lt;/LI&gt;&lt;LI&gt;To get the day-of-week, you can use &lt;B&gt;pyspark.sql.functions.date_format(..) &lt;/B&gt;with the format of "u" which yields the number of the week and name it something like &lt;B&gt;dayOfWeek&lt;/B&gt; See &lt;A href="https://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=functions#pyspark.sql.functions.date_format" alt="https://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=functions#pyspark.sql.functions.date_format" target="_blank"&gt;https://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=functions#pyspark.sql.functions.date_format&lt;/A&gt; and then &lt;A href="https://docs.oracle.com/javase/8/docs/api/java/text/SimpleDateFormat.html." alt="https://docs.oracle.com/javase/8/docs/api/java/text/SimpleDateFormat.html." target="_blank"&gt;https://docs.oracle.com/javase/8/docs/api/java/text/SimpleDateFormat.html.&lt;/A&gt;&lt;/LI&gt;&lt;LI&gt;Next would be to add one column for each day of the week (seven steps in all).&lt;UL&gt;&lt;LI&gt;The first part would be to use pyspark.sql.functions.&lt;B&gt;when(   &lt;I&gt;condition&lt;/I&gt;, &lt;I&gt;value&lt;/I&gt;&lt;/B&gt;&lt;/LI&gt;&lt;LI&gt;&amp;nbsp;&lt;/LI&gt;&lt;LI&gt;&amp;nbsp;&lt;/LI&gt;&lt;LI&gt;&amp;nbsp;&lt;/LI&gt;&lt;LI&gt;&lt;B&gt;)&lt;/B&gt;. See &lt;A href="https://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=functions#pyspark.sql.functions.when" alt="https://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=functions#pyspark.sql.functions.when" target="_blank"&gt;https://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=functions#pyspark.sql.functions.when&lt;/A&gt;&lt;/LI&gt;&lt;LI&gt;The condition would be something like &lt;B&gt;col("dayOfWeek") == "1" &lt;/B&gt;and the value would be &lt;B&gt;columnWhatever&lt;/B&gt;.&lt;/LI&gt;&lt;LI&gt;You can then append to the &lt;B&gt;when(..) &lt;/B&gt;call an &lt;B&gt;otherwise(..)&lt;/B&gt; expression with &lt;B&gt;0 &lt;/B&gt;as in &lt;B&gt;when(condition, value).otherwise(0)&lt;/B&gt;. See &lt;A href="https://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=functions#pyspark.sql.functions.when" alt="https://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=functions#pyspark.sql.functions.when" target="_blank"&gt;https://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=functions#pyspark.sql.functions.when&lt;/A&gt;&lt;/LI&gt;&lt;LI&gt;Repeat that six more times&lt;/LI&gt;&lt;/UL&gt;&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;At this stage the data will look something like this:&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper" image-alt="screenshot-10.png"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/2528i4107DE577223557F/image-size/large?v=v2&amp;amp;px=999" role="button" title="screenshot-10.png" alt="screenshot-10.png" /&gt;&lt;/span&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;But the real trick is going to be to flatten it and all that should be required is to group by &lt;B&gt;weekOfYear&lt;/B&gt; and then select with that the sum of each column (S,M,T,W,T,F,S) and that last call would look something like this:&lt;/P&gt;&lt;PRE&gt;&lt;CODE&gt;df.groupBy("weekOfYear")
  .sum("Sun", "Mon","Tue", "Wed", "Thu", "Fri", "Sat")&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;--------------- UPDATE ------------&lt;/P&gt;&lt;P&gt;The solution involving &lt;B&gt;sum(..) &lt;/B&gt;works if you have numerical data. However, if you have non-numerical data, we need a slightly different technique. In this case, we can still group by &lt;B&gt;weekOfYear&lt;/B&gt; but instead of using &lt;B&gt;sum&lt;/B&gt;(..) we can use &lt;B&gt;agg(first(...), first(...), ...)&lt;/B&gt;.&lt;/P&gt;&lt;P&gt;Thanks to Adam for the suggestion of using &lt;B&gt;sum(..) &lt;/B&gt;and to @doug for the suggestion of using &lt;B&gt;agg(first(..))&lt;/B&gt;.&lt;/P&gt;&lt;P&gt;As we were playing with this, I did put together a sample notebook that demonstrates &lt;A href="https://databricks-prod-cloudfront.cloud.databricks.com/public/13fe59d17777de29f8a2ffdf85f52925/5638528096339357/2070931/8598139271121446/latest.html" alt="https://databricks-prod-cloudfront.cloud.databricks.com/public/13fe59d17777de29f8a2ffdf85f52925/5638528096339357/2070931/8598139271121446/latest.html" target="_blank"&gt;both solutions&lt;/A&gt; .&lt;/P&gt;</description>
      <pubDate>Wed, 29 Nov 2017 00:24:00 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/i-want-to-split-a-dataframe-with-date-range-1-week-with-each/m-p/29081#M20838</guid>
      <dc:creator>User16857281974</dc:creator>
      <dc:date>2017-11-29T00:24:00Z</dc:date>
    </item>
  </channel>
</rss>

