<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic A handy tool called spark-column-analyser in Community Articles</title>
    <link>https://community.databricks.com/t5/community-articles/a-handy-tool-called-spark-column-analyser/m-p/70147#M85</link>
    <description>&lt;P&gt;I just wanted to share a tool I built called spark-column-analyzer. It's a Python package that helps you dig into your Spark DataFrames with ease.&lt;/P&gt;&lt;P&gt;Ever spend ages figuring out what's going on in your columns? Like, how many null values are there, or how many unique entries? Built with data preparation for Generative AI in mind, it aids in data imputation and augmentation – key steps for creating realistic synthetic data.&lt;/P&gt;&lt;P&gt;Basics&lt;/P&gt;&lt;P&gt;Effortless Column Analysis: It calculates all the important stats you need for each column, like null counts, distinct values, percentages, and more. No more manual counting or head scratching!&lt;BR /&gt;Simple to Use: Just toss in your DataFrame and call the analyze_column function. Bam! Insights galore.&lt;BR /&gt;Makes Data Cleaning easier: Knowing your data's quality helps you clean it up way faster. This package helps you figure out where the missing values are hiding and how much variety you've got in each column.&lt;BR /&gt;Detecting skewed columns&lt;BR /&gt;Open Source and Friendly: Feel free to tinker, suggest improvements, or even contribute some code yourself! We love collaboration in the Spark community.&lt;BR /&gt;Installation:&lt;/P&gt;&lt;P&gt;Using pip from the link: &lt;A href="https://pypi.org/project/spark-column-analyzer/" target="_blank" rel="noopener"&gt;https://pypi.org/project/spark-column-analyzer/&lt;/A&gt;&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;pip install spark-column-analyzer&lt;/STRONG&gt;&lt;/P&gt;&lt;P&gt;Also you can clone the project from gitHub&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;git clone &lt;A href="https://github.com/michTalebzadeh/spark_column_analyzer.git" target="_blank" rel="noopener"&gt;https://github.com/michTalebzadeh/spark_column_analyzer.git&lt;/A&gt;&lt;/STRONG&gt;&lt;/P&gt;&lt;P&gt;The details are in the attached RENAME file&lt;/P&gt;&lt;P&gt;Let me know what you think! Feedback is always welcome.&lt;/P&gt;&lt;P&gt;An example added to README in GitHub&lt;BR /&gt;&lt;BR /&gt;&lt;/P&gt;&lt;P&gt;Doing analysis for column Postcode&lt;/P&gt;&lt;P&gt;Json formatted output&lt;/P&gt;&lt;P&gt;{&lt;BR /&gt;"Postcode": {&lt;BR /&gt;"exists": true,&lt;BR /&gt;"num_rows": 93348,&lt;BR /&gt;"data_type": "string",&lt;BR /&gt;"null_count": 21921,&lt;BR /&gt;"null_percentage": 23.48,&lt;BR /&gt;"distinct_count": 38726,&lt;BR /&gt;"distinct_percentage": 41.49&lt;BR /&gt;}&lt;BR /&gt;}&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
    <pubDate>Tue, 21 May 2024 21:45:42 GMT</pubDate>
    <dc:creator>MichTalebzadeh</dc:creator>
    <dc:date>2024-05-21T21:45:42Z</dc:date>
    <item>
      <title>A handy tool called spark-column-analyser</title>
      <link>https://community.databricks.com/t5/community-articles/a-handy-tool-called-spark-column-analyser/m-p/70147#M85</link>
      <description>&lt;P&gt;I just wanted to share a tool I built called spark-column-analyzer. It's a Python package that helps you dig into your Spark DataFrames with ease.&lt;/P&gt;&lt;P&gt;Ever spend ages figuring out what's going on in your columns? Like, how many null values are there, or how many unique entries? Built with data preparation for Generative AI in mind, it aids in data imputation and augmentation – key steps for creating realistic synthetic data.&lt;/P&gt;&lt;P&gt;Basics&lt;/P&gt;&lt;P&gt;Effortless Column Analysis: It calculates all the important stats you need for each column, like null counts, distinct values, percentages, and more. No more manual counting or head scratching!&lt;BR /&gt;Simple to Use: Just toss in your DataFrame and call the analyze_column function. Bam! Insights galore.&lt;BR /&gt;Makes Data Cleaning easier: Knowing your data's quality helps you clean it up way faster. This package helps you figure out where the missing values are hiding and how much variety you've got in each column.&lt;BR /&gt;Detecting skewed columns&lt;BR /&gt;Open Source and Friendly: Feel free to tinker, suggest improvements, or even contribute some code yourself! We love collaboration in the Spark community.&lt;BR /&gt;Installation:&lt;/P&gt;&lt;P&gt;Using pip from the link: &lt;A href="https://pypi.org/project/spark-column-analyzer/" target="_blank" rel="noopener"&gt;https://pypi.org/project/spark-column-analyzer/&lt;/A&gt;&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;pip install spark-column-analyzer&lt;/STRONG&gt;&lt;/P&gt;&lt;P&gt;Also you can clone the project from gitHub&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;git clone &lt;A href="https://github.com/michTalebzadeh/spark_column_analyzer.git" target="_blank" rel="noopener"&gt;https://github.com/michTalebzadeh/spark_column_analyzer.git&lt;/A&gt;&lt;/STRONG&gt;&lt;/P&gt;&lt;P&gt;The details are in the attached RENAME file&lt;/P&gt;&lt;P&gt;Let me know what you think! Feedback is always welcome.&lt;/P&gt;&lt;P&gt;An example added to README in GitHub&lt;BR /&gt;&lt;BR /&gt;&lt;/P&gt;&lt;P&gt;Doing analysis for column Postcode&lt;/P&gt;&lt;P&gt;Json formatted output&lt;/P&gt;&lt;P&gt;{&lt;BR /&gt;"Postcode": {&lt;BR /&gt;"exists": true,&lt;BR /&gt;"num_rows": 93348,&lt;BR /&gt;"data_type": "string",&lt;BR /&gt;"null_count": 21921,&lt;BR /&gt;"null_percentage": 23.48,&lt;BR /&gt;"distinct_count": 38726,&lt;BR /&gt;"distinct_percentage": 41.49&lt;BR /&gt;}&lt;BR /&gt;}&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Tue, 21 May 2024 21:45:42 GMT</pubDate>
      <guid>https://community.databricks.com/t5/community-articles/a-handy-tool-called-spark-column-analyser/m-p/70147#M85</guid>
      <dc:creator>MichTalebzadeh</dc:creator>
      <dc:date>2024-05-21T21:45:42Z</dc:date>
    </item>
    <item>
      <title>Re: A handy tool called spark-column-analyser</title>
      <link>https://community.databricks.com/t5/community-articles/a-handy-tool-called-spark-column-analyser/m-p/70152#M86</link>
      <description>&lt;P&gt;An example added to README in GitHub&lt;BR /&gt;&lt;BR /&gt;&lt;/P&gt;&lt;P&gt;Doing analysis for column Postcode&lt;/P&gt;&lt;P&gt;Json formatted output&lt;/P&gt;&lt;P&gt;{&lt;BR /&gt;"Postcode": {&lt;BR /&gt;"exists": true,&lt;BR /&gt;"num_rows": 93348,&lt;BR /&gt;"data_type": "string",&lt;BR /&gt;"null_count": 21921,&lt;BR /&gt;"null_percentage": 23.48,&lt;BR /&gt;"distinct_count": 38726,&lt;BR /&gt;"distinct_percentage": 41.49&lt;BR /&gt;}&lt;BR /&gt;}&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Tue, 21 May 2024 17:59:04 GMT</pubDate>
      <guid>https://community.databricks.com/t5/community-articles/a-handy-tool-called-spark-column-analyser/m-p/70152#M86</guid>
      <dc:creator>MichTalebzadeh</dc:creator>
      <dc:date>2024-05-21T17:59:04Z</dc:date>
    </item>
  </channel>
</rss>

