cancel
Showing results for 
Search instead for 
Did you mean: 
Knowledge Sharing Hub
Dive into a collaborative space where members like YOU can exchange knowledge, tips, and best practices. Join the conversation today and unlock a wealth of collective wisdom to enhance your experience and drive success.
cancel
Showing results for 
Search instead for 
Did you mean: 

A handy tool called spark-column-analyser

MichTalebzadeh
Contributor III

I just wanted to share a tool I built called spark-column-analyzer. It's a Python package that helps you dig into your Spark DataFrames with ease.

Ever spend ages figuring out what's going on in your columns? Like, how many null values are there, or how many unique entries? Built with data preparation for Generative AI in mind, it aids in data imputation and augmentation – key steps for creating realistic synthetic data.

Basics

Effortless Column Analysis: It calculates all the important stats you need for each column, like null counts, distinct values, percentages, and more. No more manual counting or head scratching!
Simple to Use: Just toss in your DataFrame and call the analyze_column function. Bam! Insights galore.
Makes Data Cleaning easier: Knowing your data's quality helps you clean it up way faster. This package helps you figure out where the missing values are hiding and how much variety you've got in each column.
Detecting skewed columns
Open Source and Friendly: Feel free to tinker, suggest improvements, or even contribute some code yourself! We love collaboration in the Spark community.
Installation:

Using pip from the link: https://pypi.org/project/spark-column-analyzer/

pip install spark-column-analyzer

Also you can clone the project from gitHub

git clone https://github.com/michTalebzadeh/spark_column_analyzer.git

The details are in the attached RENAME file

Let me know what you think! Feedback is always welcome.

An example added to README in GitHub

Doing analysis for column Postcode

Json formatted output

{
"Postcode": {
"exists": true,
"num_rows": 93348,
"data_type": "string",
"null_count": 21921,
"null_percentage": 23.48,
"distinct_count": 38726,
"distinct_percentage": 41.49
}
}

 

Mich Talebzadeh | Technologist | Data | Generative AI | Financial Fraud
London
United Kingdom

view my Linkedin profile



https://en.everybodywiki.com/Mich_Talebzadeh



Disclaimer: The information provided is correct to the best of my knowledge but of course cannot be guaranteed . It is essential to note that, as with any advice, quote "one test result is worth one-thousand expert opinions (Werner Von Braun)".
1 REPLY 1

MichTalebzadeh
Contributor III

An example added to README in GitHub

Doing analysis for column Postcode

Json formatted output

{
"Postcode": {
"exists": true,
"num_rows": 93348,
"data_type": "string",
"null_count": 21921,
"null_percentage": 23.48,
"distinct_count": 38726,
"distinct_percentage": 41.49
}
}

 

Mich Talebzadeh | Technologist | Data | Generative AI | Financial Fraud
London
United Kingdom

view my Linkedin profile



https://en.everybodywiki.com/Mich_Talebzadeh



Disclaimer: The information provided is correct to the best of my knowledge but of course cannot be guaranteed . It is essential to note that, as with any advice, quote "one test result is worth one-thousand expert opinions (Werner Von Braun)".
Join 100K+ Data Experts: Register Now & Grow with Us!

Excited to expand your horizons with us? Click here to Register and begin your journey to success!

Already a member? Login and join your local regional user group! If there isn’t one near you, fill out this form and we’ll create one for you to join!