cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Forum Posts

Volker
by Contributor
  • 5109 Views
  • 2 replies
  • 0 kudos

Structured Streaming schemaTrackingLocation does not work with starting_version

Hello Community,I came across a strange behviour when using structured streaming on top of a delta table. I have a stream that I wanted to start from a specific version  of a delta table using the option option("starting_version", x) because I did no...

Data Engineering
Delta Lake
schemaTrackingLocation
starting_version
structured streaming
  • 5109 Views
  • 2 replies
  • 0 kudos
Latest Reply
mark_ott
Databricks Employee
  • 0 kudos

This issue is related to how Delta Lake’s structured streaming interacts with schema evolution and options like startingVersion and schemaTrackingLocation. The behavior you've observed has been noted by other users, and can be subtle due to how check...

  • 0 kudos
1 More Replies
stevenayers-bge
by Contributor
  • 4187 Views
  • 2 replies
  • 1 kudos

Querying Unity Managed Tables from Redshift

I built a script about 6 months ago to make our Delta Tables accessible in Redshift for another team, but it's a bit nasty...Generate a delta lake manifest each time the databricks delta table is updatedRecreate the redshift external table (incase th...

  • 4187 Views
  • 2 replies
  • 1 kudos
Latest Reply
mark_ott
Databricks Employee
  • 1 kudos

There is indeed a better and more integrated way to make Delta Lake tables accessible in Redshift without manually generating manifests and dynamically creating external tables or partitions. Some important points and options: Databricks Delta Lake ...

  • 1 kudos
1 More Replies
Mangeysh
by New Contributor
  • 3817 Views
  • 2 replies
  • 0 kudos

Azure data bricks API for JSON output , displaying on UI

Hello AllI am new to Azure Data Bricks and trying to show the Azure data bricks table data onto UI using react JS. Lets say there 2 tables Emplyee and Salary , I need to join these two tables with empid and generate JSON out put and calling API (end ...

  • 3817 Views
  • 2 replies
  • 0 kudos
Latest Reply
mark_ott
Databricks Employee
  • 0 kudos

The most effective way to display joined data from Azure Databricks tables (like Employee and Salary) in a React JS UI involves exposing your Databricks data through an API and then consuming that API in your frontend. Flask can work, but there are b...

  • 0 kudos
1 More Replies
rvo19941
by New Contributor II
  • 4442 Views
  • 2 replies
  • 0 kudos

Auto Loader File Notification Mode not working with ADLS Gen2 and files written as a stream

Dear,I am working on a real-time use case and am therefore using Auto Loader with file notification to ingest json files from a Gen2 Azure Storage Account in real-time. Full refreshes of my table work fine but I noticed Auto Loader was not picking up...

Data Engineering
ADLS
Auto Loader
Event Subscription
File Notification
Queue Storage
  • 4442 Views
  • 2 replies
  • 0 kudos
Latest Reply
mark_ott
Databricks Employee
  • 0 kudos

Auto Loader file notification in Databricks relies on Azure Event Grid’s BlobCreated event to trigger notifications for newly created files in Azure Data Lake Gen2. The issue you’re experiencing is a known limitation when files are written via certai...

  • 0 kudos
1 More Replies
achntrl
by New Contributor
  • 5034 Views
  • 1 replies
  • 0 kudos

CI/CD - Databricks Asset Bundles - Deploy/destroy only bundles with changes after Merge Request

Hello everyone,We're in the process of migrating to Databricks and are encountering challenges implementing CI/CD using Databricks Asset Bundles. Our monorepo houses multiple independent bundles within a "dabs" directory, with only one team member wo...

  • 5034 Views
  • 1 replies
  • 0 kudos
Latest Reply
mark_ott
Databricks Employee
  • 0 kudos

Your challenge—reliably determining the subset of changed Databricks Asset Bundles after a Merge Request (MR) is merged into main for focused deploy/destroy CI/CD actions—is common in complex monorepo, multi-environment setups. Let’s break down the p...

  • 0 kudos
alesventus
by Contributor
  • 5515 Views
  • 1 replies
  • 0 kudos

Effectively refresh Power BI report based on Delta Lake

Hi, I have several Power BI reports based on Delta Lake tables that are refreshed every 4 hours. ETL process in Databricks is much cheaper that refresh of these Power BI reports. My questions are: if approach described below is correct and if there i...

alesventus_0-1723191725173.png alesventus_1-1723192163389.png
  • 5515 Views
  • 1 replies
  • 0 kudos
Latest Reply
mark_ott
Databricks Employee
  • 0 kudos

Current Approach Assessment Power BI Import Mode: Importing all table data results in full dataset refreshes, driving up compute and data transfer costs during each refresh. Delta Lake as Source: Databricks clusters are used for both ETL and respon...

  • 0 kudos
turtleXturtle
by New Contributor II
  • 4445 Views
  • 1 replies
  • 2 kudos

Delta sharing speed

Hi - I am comparing the performance of delta shared tables and the speed is 10X slower than when querying locally.Scenario:I am using a 2XS serverless SQL warehouse, and have a table with 15M rows and 10 columns, using the below query:select date, co...

  • 4445 Views
  • 1 replies
  • 2 kudos
Latest Reply
mark_ott
Databricks Employee
  • 2 kudos

Yes, the speed difference you are seeing when querying Delta Shared tables versus local Delta tables is expected due to the architectural nature of Delta Sharing and network constraints. Why Delta Sharing Is Slower When you query a standard Delta tab...

  • 2 kudos
mv-rs
by New Contributor
  • 4518 Views
  • 1 replies
  • 0 kudos

Structured streaming not working with Serverless compute

Hi,I have a structured streaming process that is working with a normal compute but when attempting to run using Serverless, the pipeline is failing, and I'm being met with the error seen in the image below.CONTEXT: I have a Git repo with two folders,...

  • 4518 Views
  • 1 replies
  • 0 kudos
Latest Reply
mark_ott
Databricks Employee
  • 0 kudos

The core answer is: Many users encounter failures in structured streaming pipelines when switching from Databricks normal (classic) compute to Serverless, especially when using read streams on Unity Catalog Delta tables with Change Data Feed (CDF) en...

  • 0 kudos
Maatari
by New Contributor III
  • 3600 Views
  • 1 replies
  • 0 kudos

Chaining stateful Operator

I would like to do a groupby followed by a join in structured streaming. I would read from from two delta table in snapshot mode i.e. latest snapshot.My question is specifically about chaining the stateful operator. groupby is update modechaning grou...

  • 3600 Views
  • 1 replies
  • 0 kudos
Latest Reply
mark_ott
Databricks Employee
  • 0 kudos

When chaining stateful operators like groupBy (aggregation) and join in Spark Structured Streaming, there are specific rules about the output mode required for the overall query and the behavior of each operator. Output Mode Requirements The groupBy...

  • 0 kudos
jmeidam
by New Contributor
  • 4210 Views
  • 2 replies
  • 0 kudos

Displaying job-run progress when submitting jobs via databricks-sdk

When I run notebooks from within a notebook using `dbutils.notebook.run`, I see a nice progress table that updates automatically, showing the execution time, the status, links to the notebook and it is seamless.My goal now is to execute many notebook...

Capture.PNG
  • 4210 Views
  • 2 replies
  • 0 kudos
Latest Reply
Coffee77
Contributor III
  • 0 kudos

All good in @mark_ott response. As a potential improvement, instead of using polling, I think it would be better to publish events to a Bus (i.e. Azure Event Hub) from notebooks so that consumers could launch queries when receiving, processing and fi...

  • 0 kudos
1 More Replies
Maatari
by New Contributor III
  • 3816 Views
  • 1 replies
  • 0 kudos

Readying a partitioned Table in Spark Structured Streaming

Does the pre-partitioning of a Delta Table has an influence on the number of "default" Partition of a Dataframe when readying the data?Put differently, using spark structured streaming, when readying from a delta table, is the number of Dataframe par...

  • 3816 Views
  • 1 replies
  • 0 kudos
Latest Reply
mark_ott
Databricks Employee
  • 0 kudos

Pre-partitioning of a Delta Table does not strictly determine the number of "default" DataFrame partitions when reading data with Spark Structured Streaming. Unlike Kafka, where each DataFrame partition maps one-to-one to a Kafka partition, Delta Lak...

  • 0 kudos
c-thiel
by New Contributor
  • 3759 Views
  • 1 replies
  • 0 kudos

APPLY INTO Highdate instead of NULL for __END_AT

I really like the APPLY INTO function to keep track of changes and historize them in SCD2.However, I am a bit confused that current records get an __END_AT of NULL. Typically, __END_AT should be a highgate (i.e. 9999-12-31) or similar, so that a poin...

  • 3759 Views
  • 1 replies
  • 0 kudos
Latest Reply
mark_ott
Databricks Employee
  • 0 kudos

The APPLY INTO function for SCD2 historization typically sets the __END_AT field of current records to NULL rather than a "highgate" like 9999-12-31. This is by design and reflects that the record is still current and has no defined end date yet. Cur...

  • 0 kudos
NiraliGandhi
by New Contributor
  • 3956 Views
  • 1 replies
  • 0 kudos

Pyspark - alias is not applied in pivot if only one aggregation

This is not making it consistent when we perform aggregation on multiple columns and thus it is hindering metadata driven transformation because of inconsistency.How can we request Databricks/pyspark to include this ? and is there any known work arou...

  • 3956 Views
  • 1 replies
  • 0 kudos
Latest Reply
mark_ott
Databricks Employee
  • 0 kudos

When using PySpark or Databricks to perform a pivot operation with only a single aggregation, you may notice that the alias is not applied as expected, leading to inconsistencies, especially when trying to automate or apply metadata-driven frameworks...

  • 0 kudos
novytskyi
by New Contributor
  • 3783 Views
  • 1 replies
  • 0 kudos

Timeout for dbutils.jobs.taskValues.set(key, value)

I have a job that call notebook with dbutils.jobs.taskValues.set(key, value) method and assigns around 20 parameters.When I run it - it works.But when I try to call 2 or more copies of a job with different parameters - it fails with error on differen...

  • 3783 Views
  • 1 replies
  • 0 kudos
Latest Reply
mark_ott
Databricks Employee
  • 0 kudos

The error you are encountering when running multiple simultaneous Databricks jobs using dbutils.jobs.taskValues.set(key, value) indicates a connection timeout issue to the Databricks backend API (connect timed out at ...us-central1.gcp.databricks.com...

  • 0 kudos
SebastianCar28
by New Contributor
  • 3853 Views
  • 1 replies
  • 0 kudos

How to implement Lifecycle of Data When Use ADLS

Hello everyone, nice to greet you. I have a question about the data lifecycle in ADLS. I know ADLS has its own rules, but they aren't working properly because I have two ADLS accounts: one for hot data and another for cool storage where the informati...

  • 3853 Views
  • 1 replies
  • 0 kudos
Latest Reply
mark_ott
Databricks Employee
  • 0 kudos

Yes, you can move data from your HOT ADLS account to a COOL ADLS account while handling Delta Lake log issues, but this requires special techniques due to the nature of Delta Lake’s transaction log. The problem stems from Delta tables’ dependency on ...

  • 0 kudos

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!

Sign Up Now
Labels