<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic What is the best way to do EDA in Databricks? in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/what-is-the-best-way-to-do-eda-in-databricks/m-p/32688#M23828</link>
    <description>&lt;P&gt;Are there example notebooks to quickstart the exploratory data analysis?&lt;/P&gt;</description>
    <pubDate>Mon, 20 Dec 2021 22:23:29 GMT</pubDate>
    <dc:creator>Hayley</dc:creator>
    <dc:date>2021-12-20T22:23:29Z</dc:date>
    <item>
      <title>What is the best way to do EDA in Databricks?</title>
      <link>https://community.databricks.com/t5/data-engineering/what-is-the-best-way-to-do-eda-in-databricks/m-p/32688#M23828</link>
      <description>&lt;P&gt;Are there example notebooks to quickstart the exploratory data analysis?&lt;/P&gt;</description>
      <pubDate>Mon, 20 Dec 2021 22:23:29 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/what-is-the-best-way-to-do-eda-in-databricks/m-p/32688#M23828</guid>
      <dc:creator>Hayley</dc:creator>
      <dc:date>2021-12-20T22:23:29Z</dc:date>
    </item>
    <item>
      <title>Re: What is the best way to do EDA in Databricks?</title>
      <link>https://community.databricks.com/t5/data-engineering/what-is-the-best-way-to-do-eda-in-databricks/m-p/32689#M23829</link>
      <description>&lt;P&gt;A quick way to start exploratory data analysis is to use the EDA notebook that is created when you use &lt;A href="https://databricks.com/blog/2021/05/27/introducing-databricks-automl-a-glass-box-approach-to-automating-machine-learning-development.html" alt="https://databricks.com/blog/2021/05/27/introducing-databricks-automl-a-glass-box-approach-to-automating-machine-learning-development.html" target="_blank"&gt;Databricks AutoML&lt;/A&gt;. Then you can use the notebook generated as is, or as a starting point for modeling. You’ll need a cluster with Databricks Runtime 10.0 ML and above&lt;I&gt;.&lt;/I&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;To get&amp;nbsp;the EDA notebook:&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;go to the Machine Learning Persona&amp;gt;select Start AutoML on the homepage&lt;/LI&gt;&lt;LI&gt;select your cluster&lt;/LI&gt;&lt;LI&gt;the type of model you are doing (Classification, Regression, Forecasting)&lt;/LI&gt;&lt;LI&gt;your training data from a database your metastore&lt;/LI&gt;&lt;LI&gt;select field that is your prediction target&lt;/LI&gt;&lt;LI&gt;click Start AutoML&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;When the model starts running, it will autogenerate and EDA file based on a sample of your data. Because you are using spark 3.2.x on this runtime, the koalas library is merged with Pandas to get better scalability. This notebook uses the &lt;A href="https://pandas-profiling.github.io/pandas-profiling/docs/master/rtd/" alt="https://pandas-profiling.github.io/pandas-profiling/docs/master/rtd/" target="_blank"&gt;pandas-profiling&lt;/A&gt; library, so you can edit the options of the library for additional analysis. In addition to profiling, you get feature interactions, correlations, and missing values. Even if you plan on using a different runtime for your modeling in production, this is a handy shortcut for EDA.&lt;/P&gt;</description>
      <pubDate>Mon, 20 Dec 2021 22:25:21 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/what-is-the-best-way-to-do-eda-in-databricks/m-p/32689#M23829</guid>
      <dc:creator>Hayley</dc:creator>
      <dc:date>2021-12-20T22:25:21Z</dc:date>
    </item>
  </channel>
</rss>

