Data Engineering

Forum Posts

Sorted by:

by alesventus • Contributor

08-09-2024 1:40:34 AM

5430 Views
1 replies
0 kudos

Effectively refresh Power BI report based on Delta Lake

Hi, I have several Power BI reports based on Delta Lake tables that are refreshed every 4 hours. ETL process in Databricks is much cheaper that refresh of these Power BI reports. My questions are: if approach described below is correct and if there i...

Data Engineering

5430 Views
1 replies
0 kudos

08-09-2024 1:40:34 AM

View Replies

Latest Reply

mark_ott
Databricks Employee

yesterday

0 kudos

Current Approach Assessment Power BI Import Mode: Importing all table data results in full dataset refreshes, driving up compute and data transfer costs during each refresh. Delta Lake as Source: Databricks clusters are used for both ETL and respon...

0 kudos

yesterday

by turtleXturtle • New Contributor II

08-09-2024 7:23:02 AM

4349 Views
1 replies
2 kudos

Delta sharing speed

Hi - I am comparing the performance of delta shared tables and the speed is 10X slower than when querying locally.Scenario:I am using a 2XS serverless SQL warehouse, and have a table with 15M rows and 10 columns, using the below query:select date, co...

Data Engineering

4349 Views
1 replies
2 kudos

08-09-2024 7:23:02 AM

View Replies

Latest Reply

mark_ott
Databricks Employee

yesterday

2 kudos

Yes, the speed difference you are seeing when querying Delta Shared tables versus local Delta tables is expected due to the architectural nature of Delta Sharing and network constraints. Why Delta Sharing Is Slower When you query a standard Delta tab...

2 kudos

yesterday

by mv-rs • New Contributor

08-12-2024 7:37:46 AM

4427 Views
1 replies
0 kudos

Structured streaming not working with Serverless compute

Hi,I have a structured streaming process that is working with a normal compute but when attempting to run using Serverless, the pipeline is failing, and I'm being met with the error seen in the image below.CONTEXT: I have a Git repo with two folders,...

Data Engineering

4427 Views
1 replies
0 kudos

08-12-2024 7:37:46 AM

View Replies

Latest Reply

mark_ott
Databricks Employee

yesterday

0 kudos

The core answer is: Many users encounter failures in structured streaming pipelines when switching from Databricks normal (classic) compute to Serverless, especially when using read streams on Unity Catalog Delta tables with Change Data Feed (CDF) en...

0 kudos

yesterday

by Maatari • New Contributor III

08-13-2024 5:53:47 AM

3561 Views
1 replies
0 kudos

Chaining stateful Operator

I would like to do a groupby followed by a join in structured streaming. I would read from from two delta table in snapshot mode i.e. latest snapshot.My question is specifically about chaining the stateful operator. groupby is update modechaning grou...

Data Engineering

3561 Views
1 replies
0 kudos

08-13-2024 5:53:47 AM

View Replies

Latest Reply

mark_ott
Databricks Employee

yesterday

0 kudos

When chaining stateful operators like groupBy (aggregation) and join in Spark Structured Streaming, there are specific rules about the output mode required for the overall query and the behavior of each operator. Output Mode Requirements The groupBy...

0 kudos

yesterday

by jmeidam • New Contributor

08-16-2024 3:36:44 AM

4147 Views
2 replies
0 kudos

Displaying job-run progress when submitting jobs via databricks-sdk

When I run notebooks from within a notebook using `dbutils.notebook.run`, I see a nice progress table that updates automatically, showing the execution time, the status, links to the notebook and it is seamless.My goal now is to execute many notebook...

Data Engineering

4147 Views
2 replies
0 kudos

08-16-2024 3:36:44 AM

View Replies

Latest Reply

Coffee77
Contributor III

yesterday

0 kudos

All good in @mark_ott response. As a potential improvement, instead of using polling, I think it would be better to publish events to a Bus (i.e. Azure Event Hub) from notebooks so that consumers could launch queries when receiving, processing and fi...

0 kudos

yesterday

1 More Replies

by Maatari • New Contributor III

08-13-2024 5:56:59 AM

3762 Views
1 replies
0 kudos

Readying a partitioned Table in Spark Structured Streaming

Does the pre-partitioning of a Delta Table has an influence on the number of "default" Partition of a Dataframe when readying the data?Put differently, using spark structured streaming, when readying from a delta table, is the number of Dataframe par...

Data Engineering

3762 Views
1 replies
0 kudos

08-13-2024 5:56:59 AM

View Replies

Latest Reply

mark_ott
Databricks Employee

yesterday

0 kudos

Pre-partitioning of a Delta Table does not strictly determine the number of "default" DataFrame partitions when reading data with Spark Structured Streaming. Unlike Kafka, where each DataFrame partition maps one-to-one to a Kafka partition, Delta Lak...

0 kudos

yesterday

by c-thiel • New Contributor

08-13-2024 7:42:44 AM

3708 Views
1 replies
0 kudos

APPLY INTO Highdate instead of NULL for __END_AT

I really like the APPLY INTO function to keep track of changes and historize them in SCD2.However, I am a bit confused that current records get an __END_AT of NULL. Typically, __END_AT should be a highgate (i.e. 9999-12-31) or similar, so that a poin...

Data Engineering

3708 Views
1 replies
0 kudos

08-13-2024 7:42:44 AM

View Replies

Latest Reply

mark_ott
Databricks Employee

yesterday

0 kudos

The APPLY INTO function for SCD2 historization typically sets the __END_AT field of current records to NULL rather than a "highgate" like 9999-12-31. This is by design and reflects that the record is still current and has no defined end date yet. Cur...

0 kudos

yesterday

by NiraliGandhi • New Contributor

08-13-2024 8:22:57 AM

3889 Views
1 replies
0 kudos

Pyspark - alias is not applied in pivot if only one aggregation

This is not making it consistent when we perform aggregation on multiple columns and thus it is hindering metadata driven transformation because of inconsistency.How can we request Databricks/pyspark to include this ? and is there any known work arou...

Data Engineering

3889 Views
1 replies
0 kudos

08-13-2024 8:22:57 AM

View Replies

Latest Reply

mark_ott
Databricks Employee

yesterday

0 kudos

When using PySpark or Databricks to perform a pivot operation with only a single aggregation, you may notice that the alias is not applied as expected, leading to inconsistencies, especially when trying to automate or apply metadata-driven frameworks...

0 kudos

yesterday

by novytskyi • New Contributor

08-14-2024 7:24:15 AM

3738 Views
1 replies
0 kudos

Timeout for dbutils.jobs.taskValues.set(key, value)

I have a job that call notebook with dbutils.jobs.taskValues.set(key, value) method and assigns around 20 parameters.When I run it - it works.But when I try to call 2 or more copies of a job with different parameters - it fails with error on differen...

Data Engineering

3738 Views
1 replies
0 kudos

08-14-2024 7:24:15 AM

View Replies

Latest Reply

mark_ott
Databricks Employee

yesterday

0 kudos

The error you are encountering when running multiple simultaneous Databricks jobs using dbutils.jobs.taskValues.set(key, value) indicates a connection timeout issue to the Databricks backend API (connect timed out at ...us-central1.gcp.databricks.com...

0 kudos

yesterday

by SebastianCar28 • New Contributor

08-15-2024 11:37:20 AM

3803 Views
1 replies
0 kudos

How to implement Lifecycle of Data When Use ADLS

Hello everyone, nice to greet you. I have a question about the data lifecycle in ADLS. I know ADLS has its own rules, but they aren't working properly because I have two ADLS accounts: one for hot data and another for cool storage where the informati...

Data Engineering

3803 Views
1 replies
0 kudos

08-15-2024 11:37:20 AM

View Replies

Latest Reply

mark_ott
Databricks Employee

yesterday

0 kudos

Yes, you can move data from your HOT ADLS account to a COOL ADLS account while handling Delta Lake log issues, but this requires special techniques due to the nature of Delta Lake’s transaction log. The problem stems from Delta tables’ dependency on ...

0 kudos

yesterday

by SrinuM • New Contributor III

08-22-2024 1:38:17 PM

3862 Views
1 replies
0 kudos

Workspace Client dbutils issue

host = "https://adb-xxxxxx.xx.azuredatabricks.net"token = "dapxxxxxxx"we are using databricksconnect from databricks.sdk import WorkspaceClientdbutil = WorkspaceClient(host=host,token=token).dbutilsfiles = dbutil.fs.ls("abfss://container-name@storag...

Data Engineering

3862 Views
1 replies
0 kudos

08-22-2024 1:38:17 PM

View Replies

Latest Reply

mark_ott
Databricks Employee

yesterday

0 kudos

The error where files and directories can be read at the root ADLS level but not at the blob/subdirectory level, combined with a "No file or directory exists on path" message, is frequently due to permission configuration, incorrect path usage, or ne...

0 kudos

yesterday

by Anshul_DBX • New Contributor

08-20-2024 3:54:14 AM

3773 Views
1 replies
0 kudos

Executing Stored Procedures/update in Federated SQL Server

I have federated Azure SQL DB in my DBX workspace, but I am not able to run update commands or execute a stored procedure, is this still not supported?

Data Engineering

3773 Views
1 replies
0 kudos

08-20-2024 3:54:14 AM

View Replies

Latest Reply

mark_ott
Databricks Employee

yesterday

0 kudos

Federated connections from Azure Databricks to Azure SQL DB via Lakehouse Federation currently only support read-only queries—meaning running update commands or executing stored procedures directly through the federated Unity Catalog interface is not...

0 kudos

yesterday

by Nidhig • Contributor

Sunday

32 Views
1 replies
1 kudos

Resolved! Uplimit lab access limit exceeded- How I can get more limits

Data Engineering

32 Views
1 replies
1 kudos

Sunday

View Replies

Latest Reply

Advika
Databricks Employee

yesterday

1 kudos

Hello @Nidhig! Unfortunately, you won’t be able to extend or reset your current lab time limit. Your Vocareum Lab access includes a total of 720 minutes (12 hours), and the timer continues to run whenever the lab environment is active, even if you’re...

1 kudos

yesterday

by pooja_bhumandla • New Contributor III

Friday

69 Views
2 replies
0 kudos

Best Practice for Updating Data Skipping Statistics for Additional Columns

Hi Community,I have a scenario where I’ve already calculated delta statistics for the first 32 columns after enabling the dataskipping property. Now, I need to include 10 more frequently used columns that were not part of the original 32.Goal:I want ...

Data Engineering

69 Views
2 replies
0 kudos

Friday

View Replies

Latest Reply

szymon_dybczak
Esteemed Contributor III

Friday

0 kudos

Hi @pooja_bhumandla ,Updating any of two below options does not automatically recompute statistics for existing data. Rather, it impacts the behavior of future statistics collection when adding or updating data in the table.- delta.dataSkippingNumInd...

0 kudos

Friday

1 More Replies

by vamsi_simbus • New Contributor III

Sunday

23 Views
1 replies
0 kudos

System tables for DLT Expectations Quality Metrics

Hi Everyone,I’m working with Delta Live Tables (DLT) and using Expectations to track data quality, but I’m having trouble finding where the expectation quality metrics are stored in the DLT system tables.My questions are:Which specific system table(s...

Data Engineering

23 Views
1 replies
0 kudos

Sunday

View Replies

Latest Reply

ManojkMohan
Honored Contributor II

Sunday

0 kudos

@vamsi_simbus DLT captures data quality metrics in specialized system tables known as “event” and “metrics” tables. Specifically, look in the following table:LIVE.DLT_EVENT_LOG or LIVE.DLT_METRICS: These tables contain granular event logs and metric...

0 kudos

Sunday

Databricks Community

Forum Posts

Effectively refresh Power BI report based on Delta Lake

Delta sharing speed

Structured streaming not working with Serverless compute

Chaining stateful Operator

Displaying job-run progress when submitting jobs via databricks-sdk

Readying a partitioned Table in Spark Structured Streaming

APPLY INTO Highdate instead of NULL for __END_AT

Pyspark - alias is not applied in pivot if only one aggregation

Timeout for dbutils.jobs.taskValues.set(key, value)

How to implement Lifecycle of Data When Use ADLS

Workspace Client dbutils issue

Executing Stored Procedures/update in Federated SQL Server

Resolved! Uplimit lab access limit exceeded- How I can get more limits

Best Practice for Updating Data Skipping Statistics for Additional Columns

System tables for DLT Expectations Quality Metrics

Join Us as a Local Community Builder!

Data Pipeline for Bringing Data from Oracle Fusion...

Accessing Databricks data in Salesforce via zero c...

Issues recreating Tables with enableRowTracking an...

Uplimit lab access limit exceeded- How I can get m...

No rows returned when calling Databricks procedure...