Databricks

sridhar0109 · ‎02-15-2023

Hi All,

I'm working on creating a data quality dashboard. I've created few rules like checking nulls in a column, checking for data type of the column , removing duplicates etc.

We follow medallion architecture and are applying these data quality checks on bronze table and insert rows which pass the data quality checks as mentioned above.

Now, I want to track distribution of a column over a period of time like for example: I have sales data for different car models ,then a distribution of the sales of each car model over a period of time.

Could you please suggest if there are any out of box libraries available to achieve this task?

Thanks!

Anonymous · ‎04-09-2023

@Sridhar Varanasi :

Here are a few options you might want to consider:

pandas: pandas is a popular library for data manipulation and analysis in Python. It provides tools for data cleaning, data wrangling, and data visualization, and has a number of built-in functions for analyzing data over time. You can use pandas to load your data into a dataframe and then use its built-in functions to calculate the distribution of a column over time.
seaborn: seaborn is a Python data visualization library based on matplotlib. It provides a high-level interface for creating informative and attractive statistical graphics. You can use seaborn to create different types of visualizations, including line charts, bar charts, and heatmaps, to track the distribution of a column over time.
Plotly: Plotly is a powerful data visualization library for creating interactive, web-based charts and dashboards. It has a wide range of chart types and customization options, and allows you to create complex visualizations that can be easily shared with others. You can use Plotly to create interactive line charts, scatter plots, and other types of visualizations that track the distribution of a column over time.
Apache Superset: Apache Superset is an open-source data visualization and exploration platform that allows you to create interactive dashboards and visualizations using a web-based interface. It supports a wide range of data sources and provides a number of built-in visualization types, including time-series charts, histograms, and scatter plots. You can use Apache Superset to create custom dashboards that track the distribution of a column over time.

These are just a few examples of the libraries available for creating data quality dashboards and tracking the distribution of a column over time. Depending on your specific requirements and the complexity of your data, you may need to use a combination of these libraries or other tools to achieve your desired results.

Anonymous · ‎04-20-2023

Hi @Sridhar Varanasi

Hope all is well! Just wanted to check in if you were able to resolve your issue and would you be happy to share the solution or mark an answer as best? Else please let us know if you need more help.

We'd love to hear from you.

Thanks!

Databricks

Tracking changes in data distribution by using pyspark

Unity Catalog Lakeguard: Industry-first and only data governance for multi-user Apache™ Spark cluste

Announcing the General Availability of Databricks Asset Bundles

Register now and save 50% on training at Data + AI Summit!

How to successfully build GenAI applications

Meet DBRX, the New Standard for High-Quality LLMs