cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Databricks Vs ADLS for reporting.

Learnit
New Contributor II
Hi everyone,
 
I'm a business analyst currently facing a decision on how best to develop a report. I need to choose between using Databricks or Azure Data Lake Storage (ADLS) as the data source on the transformed data in csv or excel file format from databricks into ADLS. My main concern is about the capability to write SQL queries, especially since I've heard that it might not be possible with ADLS. Our IT team is suggesting we use ADLS for this task.
 
Could anyone share insights on:
The pros and cons of using Databricks compared to ADLS for report generation?
 If using Databricks, does the cluster need to be up all the time for report refreshing?
 What are the best practices for setting this up?
 
Any advice or experiences shared would be greatly appreciated!
 
Thanks in Advance
1 REPLY 1

Palash01
Valued Contributor

Hey @Learnit 

I'd be glad to help, thanks for posting your concern. To offer the most effective advice, I might need some additional context about your specific situation as looks like your use case is to create reports (dahboards) using local CSV/excel files which can be done in different way as well but as you've asked the question around ADLS and Databricks I will answer it first then will mention my personal alternates to this condition. I hope this can help you choose a best solution for your situation:

  1. SQL Capabilities:

    • ADLS: Azure Data Lake Storage (ADLS) is a cloud-based data storage which is designed to store large-scale heterogenous data you can imagine ADLS as a windows file explorer folder or google drive in layman terms. As said it is a storage service it doesn't directly support SQL queries, though you can use service such as Azure Synapse Analytics which is fully managed data warehousing service that seamlessly integrates with ADLS, offering SQL querying and advanced analytics capabilities.
    • Databricks SQL: If you're already using Databricks, leverage native SQL capabilities to query data stored in ADLS using the COPY INTO command or by creating a ETL in databricks. Now if you choose this option you can also use databricks sql dashboards which can be created using simple SQL and can be refreshed based on a schedule you provide. or you can use partner connect and connect a BI tool such as power BI, tableau etc. to generate reports.
  2. Data Source Choice:

    • ADLS: Suitable for storing large, semi-structured, or unstructured data that might not be ideal for traditional databases.
    • Databricks: More suited for structured data and interactive exploration, but can also accommodate larger datasets with distributed processing. Also, you can write a easy code which can load your local CSV files to databricks delta lake then you can use that table to query in sql and create dashboard.

Recommendations:

Looking at the post it looks like this is a request where in you will be using a local CSV and you want to create a report now looking at this I can think of a couple of ways:

  • I'll mention very basic one first use excel to create a report if the datasource is small and manageable and which does not need very frequent updates or miraculous charts.
  • If this is something which needs a refresh may be once a day and have less visibility across the team you can still keep your data locally in CSVs use powerBI, Tableau, Looker, and import data directly there and then use it to create reports, to refresh report you can update the CSV sheet and refresh the dashboard. In this way you can save some bucks on the compute which will be use to read CSV and write to a table and storage costs of ADLS. And of course if need arises you can use your same report later on you'll just need to replace your data source with databricks or synapse.
  • If you think this could be something in which you will receive data in hourly / daily manner and will be displayed across the org or have high visibility then you can consider storing those CSVs in ADLS then use Azure Synapse Analytics or Databricks (delta live tables) to create a ETL process which can load your CSV files into a table based on a schedule and then you can use databricks sql or as I mentioned partner connect in data bricks or synapse to connect to any BI tool. 

Again, these are just suggestion looking at what your mentioned in our post if you need more help please include the below answers as well as your concerns based on this post here we'll be happy to help:

  • How large and intricate are the transformed data files stored in ADLS?
  • How often do you need the report to be refreshed, and how quickly do you require updates to reflect?
  • Who will use these reports? Do they need interactive dashboards or static reports?
  • What level of familiarity with SQL and data engineering and BI tools exists within your team?

Best of Luck! 

Leave a like if this helps! Kudos,
Palash

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโ€™t want to miss the chance to attend and share knowledge.

If there isnโ€™t a group near you, start one and help create a community that brings people together.

Request a New Group