cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

How to enable/verify cloud fetch from PowerBI

Erik
Valued Contributor II

I tried to benchmark the Powerbi Databricks connector vs the powerbi Delta Lake reader on a dataset of 2.15million rows. I found that the delta lake reader used 20 seconds, while importing through the SQL compute endpoint took ~75 seconds.

When I look at the query profile in SQL compute I see that 50 seconds are spendt in the "Columnar To Row" step. This makes me rather suspicios, since I got the impression that with an updated PowerBI we would take advantage of "cloud fetch" which creates files containing Apache Arrow batches, which is a columnar format. So why the conversion to rows? Maybe it is not actually using cloud fetch? Is there any way to verify that I am actually using cloud fetch? Either in PowerBi logs or in the Databricks SQL compute endpoint web interface?

query_statisticsquery_profile_tree_view

22 REPLIES 22

Erik
Valued Contributor II

Hey @Kaniz Fatma​, you are usually so fast to write that the community will probably help, and otherwise you will find someone at Databricks to help. But now it's been several days. Is everything OK with you?

Kaniz_Fatma
Community Manager
Community Manager

Hi @Erik Parmann​ , Thank you for your concern. Everything is fine with me. Let's see if your peers in the community have an answer to your question first. Or else I will get back to you soon. Thanks.

Anonymous
Not applicable

@Erik Parmann​  - We are looking for someone to help you. Thank you for your continued patience.

pichlerpa
New Contributor III

Hi everyone, we are facing exactly the same problem, result fetching takes far too long by connecting remotely via R Studio and ODBC for interactive workloads. We made sure to use the latest version of the ODBC Databicks connector together with the latest Databricks runtime with Cloud Fetch enabled. Unfortunately, without any effect. We tried:

  • scaling up the cluster
  • tweaking the connector (RowsFetchedPerBlock, UseNativeQuery, etc.)
  • using Databricks Connect with Apache Arrow as an alternative (untraceable OOM error by transferring (collect) Spark DataFrame to local R)

Databricks Connect OOM

pichlerpa
New Contributor III

WRT databricks-connect, we were able to fix the OOM error by increasing the memory of the local Spark driver instance which is used for the remote communication and runs in the background:

conf <- spark_config()

conf$`sparklyr.shell.driver-memory` <- "10G"

databricks_connect_spark_home <- system("databricks-connect get-spark-home", intern = TRUE)

sc <-

 spark_connect(

   method = "databricks",

   spark_home = databricks_connect_spark_home,

   config=conf

 )

Kaniz_Fatma
Community Manager
Community Manager

Hi @Erik Parmann​ , Get started with Cloud Fetch by downloading and installing the latest ODBC driver. The feature is available in Databricks SQL and interactive Databricks clusters deployed with Databricks Runtime 8.3 or higher both on Azure Databricks and Amazon. We incorporated the Cloud Fetch mechanism in the latest version of the Simba ODBC driver 2.6.17 and in the forthcoming Simba JDBC driver 2.6.18.

Source

Erik
Valued Contributor II

Thanks, but I have read that as well, which is why I am looking for a way to confirm that cloud fetch is actually working.

Databricks representative said that if we are using an update powerbi desktop version (I am using "2.100.1401.0 64-bit (desember 2021)") then this includes an updated version of the ODBC driver which sould use cloud fetch. Source. Can you confirm if this is right or wrong?

This is important for us because we have many users on powerbi, and it is a big difference for us if we just need to update their powerbi innstalation vs installing a custom odbc driver.

gbrueckl
Contributor II

So I am on the latest version of Power BI Desktop and if I go to ODBC-Drivers and check the Simba driver it still shows 2.06.16.1019 which does not yet support cloudfetch

gbrueckl
Contributor II

just an update after re-installing Power BI Desktop (download-able version):

if you check under C:\Program Files\Microsoft Power BI Desktop\bin\ODBC Drivers\Simba Spark ODBC Driver

you will see that it is actually a more recent version (2.6.18.1030) which should support cloud fetch

I had an older version of Simba Spark driver installed manually before - dont know which version Power BI was using then - but I uninstalled this one now and now Power BI can only use the most recent one it comes with

Erik
Valued Contributor II

Thanks for the tip! I ventured into the powerbi folder (inside WindowsApps), and in the subfolder "bin\ODBC Drivers\Simba Spark ODBC Driver" I found the version by running "cat SparkODBC_sb64.dll | findstr Version". It printed "ProductVersion2.6.18.1030".

So this *should* support cloudfetch, but I still see the odd performance characteristics as described above. So my question still stands ( @Piper Wilson​ ), is there any way to *confirm* that cloud fetch has been used? This really seems like a thing one should be able to see some traces of in the Query Profile inside databricks.

Kaniz_Fatma
Community Manager
Community Manager

Hi @Gerhard Brueckl​ and @Erik Parmann​ , The ODBC driver version 2.6.17 and above supports Cloud Fetch, a capability that fetches query results through the cloud storage set up in your Azure Databricks deployment.

To extract query results using this format, you need Databricks Runtime 8.3 or above.

Query results are uploaded to an internal DBFS storage location as arrow-serialized files of up to 20 MB. Azure Databricks generates and returns shared access signatures to the uploaded files when the driver sends fetch requests after query completion. The ODBC driver then uses the URLs to download the results directly from DBFS.

Cloud Fetch is only used for query results more significant than 1 MB. More minor effects are retrieved directly from Azure Databricks.

Azure Databricks automatically collects the accumulated files marked for deletion after 24 hours. These marked files are wholly deleted after an additional 24 hours.

To learn more about the Cloud Fetch architecture, see How We Achieved High-bandwidth Connectivity With BI Tools.

Erik
Valued Contributor II

@Arjun Kaimaparambil Rajan​ can you maybe check the query with ID 01ecdb90-5d68-1f39-a597-c1ce377fab5a with

Start time: 2022-05-24 20:36:03.058 (UTC+2)

End time: 2022-05-24 20:37:37.461 (UTC+2)

?

arjun_kr
Contributor III
Contributor III

@Erik Parmann​  Yes. This query result fetch has cloud fetch enabled.

Anonymous
Not applicable

@Erik Parmann​ - Does @Gerhard Brueckl​'s answer help?

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group