cancel
Showing results for 
Search instead for 
Did you mean: 
Machine Learning
cancel
Showing results for 
Search instead for 
Did you mean: 

Tracking changes in data distribution by using pyspark

sridhar0109
New Contributor

Hi All,

I'm working on creating a data quality dashboard. I've created few rules like checking nulls in a column, checking for data type of the column , removing duplicates etc.

We follow medallion architecture and are applying these data quality checks on bronze table and insert rows which pass the data quality checks as mentioned above.

Now, I want to track distribution of a column over a period of time like for example: I have sales data for different car models ,then a distribution of the sales of each car model over a period of time.

Could you please suggest if there are any out of box libraries available to achieve this task?

Thanks!

2 REPLIES 2

Anonymous
Not applicable

@Sridhar Varanasi​ :

Here are a few options you might want to consider:

  1. pandas: pandas is a popular library for data manipulation and analysis in Python. It provides tools for data cleaning, data wrangling, and data visualization, and has a number of built-in functions for analyzing data over time. You can use pandas to load your data into a dataframe and then use its built-in functions to calculate the distribution of a column over time.
  2. seaborn: seaborn is a Python data visualization library based on matplotlib. It provides a high-level interface for creating informative and attractive statistical graphics. You can use seaborn to create different types of visualizations, including line charts, bar charts, and heatmaps, to track the distribution of a column over time.
  3. Plotly: Plotly is a powerful data visualization library for creating interactive, web-based charts and dashboards. It has a wide range of chart types and customization options, and allows you to create complex visualizations that can be easily shared with others. You can use Plotly to create interactive line charts, scatter plots, and other types of visualizations that track the distribution of a column over time.
  4. Apache Superset: Apache Superset is an open-source data visualization and exploration platform that allows you to create interactive dashboards and visualizations using a web-based interface. It supports a wide range of data sources and provides a number of built-in visualization types, including time-series charts, histograms, and scatter plots. You can use Apache Superset to create custom dashboards that track the distribution of a column over time.

These are just a few examples of the libraries available for creating data quality dashboards and tracking the distribution of a column over time. Depending on your specific requirements and the complexity of your data, you may need to use a combination of these libraries or other tools to achieve your desired results.

Anonymous
Not applicable

Hi @Sridhar Varanasi​ 

Hope all is well! Just wanted to check in if you were able to resolve your issue and would you be happy to share the solution or mark an answer as best? Else please let us know if you need more help. 

We'd love to hear from you.

Thanks!

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.