<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic how does the data science workflow change in databricks if you start with a nosql database (specifically document store) instead of something more traditional/rdbms type source? in Machine Learning</title>
    <link>https://community.databricks.com/t5/machine-learning/how-does-the-data-science-workflow-change-in-databricks-if-you/m-p/10684#M512</link>
    <description>&lt;P&gt;I'm sorry if this is a bad question. The tl;dr is &lt;/P&gt;&lt;OL&gt;&lt;LI&gt;&lt;B&gt;are there any concrete examples of a nosql data science workflows specifically in databricks and if so what are they?&lt;/B&gt;&lt;/LI&gt;&lt;LI&gt;&lt;B&gt;is it always the case that our end goal is a dataframe?&lt;/B&gt;&lt;/LI&gt;&lt;/OL&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;For us we start as a bunch of parquet files in the azure blob storage, and then construct a hive metastore on top of that, and from there in either pyspark or spark sql, it behaves like a traditional rdbms. I think this counts as sql, right? or if there is nosql, our goal is to turn the data into a sql format as quick as possible?&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;If I started with data in a document store format in azure blob storage or connected to mongo, does the downstream after reading the raw data change? I'm visualizing the current process as:&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;[raw data] -&amp;gt; [transform data] -&amp;gt; [clean/standardized data] -&amp;gt; [training/selection/deployment/anything after]&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;If this is still relevant with a document store database, does [clean/standardized] step always be a dataframe, or is a dataframe just one of the possible inputs to the machine learning process? If so, how common is a dataframe as an input instead of another format? Any concrete example of a workflow with nosql would be extremely helpful. &lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;for scoring I'd imagine a document store would be like the ideal format as an input. &lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;My background is in statistics so I've always gotten a clean table as an input, so in my job right now my conception has always been "get to a clean table and then do data science on that." I'm just wondering if that's too narrow a view on how the data can go. &lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;I've searched so many permutations on the key words and am getting nowhere.&lt;/P&gt;</description>
    <pubDate>Wed, 25 Jan 2023 17:24:54 GMT</pubDate>
    <dc:creator>jonathan-dufaul</dc:creator>
    <dc:date>2023-01-25T17:24:54Z</dc:date>
    <item>
      <title>how does the data science workflow change in databricks if you start with a nosql database (specifically document store) instead of something more traditional/rdbms type source?</title>
      <link>https://community.databricks.com/t5/machine-learning/how-does-the-data-science-workflow-change-in-databricks-if-you/m-p/10684#M512</link>
      <description>&lt;P&gt;I'm sorry if this is a bad question. The tl;dr is &lt;/P&gt;&lt;OL&gt;&lt;LI&gt;&lt;B&gt;are there any concrete examples of a nosql data science workflows specifically in databricks and if so what are they?&lt;/B&gt;&lt;/LI&gt;&lt;LI&gt;&lt;B&gt;is it always the case that our end goal is a dataframe?&lt;/B&gt;&lt;/LI&gt;&lt;/OL&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;For us we start as a bunch of parquet files in the azure blob storage, and then construct a hive metastore on top of that, and from there in either pyspark or spark sql, it behaves like a traditional rdbms. I think this counts as sql, right? or if there is nosql, our goal is to turn the data into a sql format as quick as possible?&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;If I started with data in a document store format in azure blob storage or connected to mongo, does the downstream after reading the raw data change? I'm visualizing the current process as:&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;[raw data] -&amp;gt; [transform data] -&amp;gt; [clean/standardized data] -&amp;gt; [training/selection/deployment/anything after]&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;If this is still relevant with a document store database, does [clean/standardized] step always be a dataframe, or is a dataframe just one of the possible inputs to the machine learning process? If so, how common is a dataframe as an input instead of another format? Any concrete example of a workflow with nosql would be extremely helpful. &lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;for scoring I'd imagine a document store would be like the ideal format as an input. &lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;My background is in statistics so I've always gotten a clean table as an input, so in my job right now my conception has always been "get to a clean table and then do data science on that." I'm just wondering if that's too narrow a view on how the data can go. &lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;I've searched so many permutations on the key words and am getting nowhere.&lt;/P&gt;</description>
      <pubDate>Wed, 25 Jan 2023 17:24:54 GMT</pubDate>
      <guid>https://community.databricks.com/t5/machine-learning/how-does-the-data-science-workflow-change-in-databricks-if-you/m-p/10684#M512</guid>
      <dc:creator>jonathan-dufaul</dc:creator>
      <dc:date>2023-01-25T17:24:54Z</dc:date>
    </item>
    <item>
      <title>Re: how does the data science workflow change in databricks if you start with a nosql database (specifically document store) instead of something more traditional/rdbms type source?</title>
      <link>https://community.databricks.com/t5/machine-learning/how-does-the-data-science-workflow-change-in-databricks-if-you/m-p/10685#M513</link>
      <description>&lt;P&gt;Nice sharing, thanks!&lt;/P&gt;</description>
      <pubDate>Tue, 31 Jan 2023 13:18:18 GMT</pubDate>
      <guid>https://community.databricks.com/t5/machine-learning/how-does-the-data-science-workflow-change-in-databricks-if-you/m-p/10685#M513</guid>
      <dc:creator>Nhan_Nguyen</dc:creator>
      <dc:date>2023-01-31T13:18:18Z</dc:date>
    </item>
  </channel>
</rss>

