Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
Suppose I have a Delta Live Tables framework with 2 tables: Table 1 ingests from a json source, Table 2 reads from Table 1 and runs some transformation.In other words, the data flow is json source -> Table 1 -> Table 2. Now if I find some bugs in the...
Answering my own question: nowadays (February 2024) this can all be done via the UI.When viewing your DLT pipeline there is a "Select tables for refresh" button in the header. If you click this, you can select individual tables, and then in the botto...
Hi Team,Can we pass Delta Live Table name dynamically [from a configuration file, instead of hardcoding the table name]? We would like to build a metadata-driven pipeline.
Is this post referring to Direct Publishing Mode? As we are multi-tenanted we have to have separate schema per client, which currently means a single pipeline per client. This is not cost effective at all, so we are very much reliant on DPM. I believ...
I'm facing an error in Delta Live Tables when I want to pivot a table. The error is the following: And the code to replicate the error is the following:import pandas as pd
import pyspark.sql.functions as F
pdf = pd.DataFrame({"A": ["foo", "foo", "f...
Hi, Was this a specific design choice to not allow Pivots in DLT? I'm under the impression they expect fixed table structures in DLT design for a reason, but I don't understand the reason? Conceptually, I understand the fixed structures makes lineage...
Hi Community,I have successfully run a job through the API but would need to be able to pass parameters (configuration) to the DLT workflow via the APII have tried passing JSON in this format:{
"full_refresh": "true",
"configuration": [
...
You cannot pass parameters from a Databricks job to a DLT pipeline. Atleast not yet. You can see from the DLT rest API that there is no option for it to accept any parameters.But there is a workaround.But there is a workaround.With the assumption tha...
You’ve gotten familiar with Delta Live Tables (DLT) via the quickstart and getting started guide. Now it’s time to tackle creating a DLT data pipeline for your cloud storage–with one line of code. Here’s how it’ll look when you're starting:CREATE OR ...
Hi MadelynM,How should we handle Source File Archival and Data Retention with DLT? Source File Archival: Once the data from source file is loaded with DLT Auto Loader, we want to move the source file from source folder to archival folder. How can we ...
Hello community :).I am currently implementing some pipelines using DLT. They are working great for my medalion architecture for landed json in bronze -> silver (using apply_changes) then materialized gold views ontop.However, I am attempting to crea...
Is it possible to have custom upserts for streaming tables in delta live tables?Im getting the error:pyspark.errors.exceptions.captured.AnalysisException: `blusmart_poc.information_schema.sessions` is not a Delta table.
Hello! I'm very new to working with Delta Live Tables and I'm having some issues. I'm trying to import a large amount of historical data into DLT. However letting the DLT pipeline run forever doesn't work with the database we're trying to import from...
Hi @Sarah Guido Thank you for posting your question in our community! We are happy to assist you.To help us provide you with the most accurate information, could you please take a moment to review the responses and select the one that best answers y...
I'm a little confused about how streaming works with DLT. My first questions is what is the difference in behavior if you set the pipeline mode to "Continuous" but in your notebook you don't use the "streaming" prefix on table statements, and simila...
Is it possible to have custom upserts in streaming tables in a delta live tables pipeline?Use case: I am trying to maintain a valid session based on timestamp column and want to upsert to the target table.Tried going through the documentations but dl...
How to leverage Change Data Capture (CDC) from your databases to DatabricksChange Data Capture allows you to ingest and process only changed records from database systems to dramatically reduce data processing costs and enable real-time use cases suc...
What is the difference between Databricks Auto-Loader and Delta Live Tables? Both seem to manage ETL for you but I'm confused on where to use one vs. the other.
Hi,we are in process of moving our Datawarehouse from sql server to databricks. we are in process of testing our Dimension Product table which has identity column for referencing in fact table as surrogate key. In Databricks Apply changes SCD type 2 ...
Hey. Yep, xxhash64 (or even just hash) generate numerical values for you. Combine with abs function to ensure the value is positive. In our team we used abs(hash()) ourselves... for maybe a day. Very quickly I observed a collision, and the data s...
Hello everyone!So I want to ingest tables with schemas from the on-premise SQL server to Databricks Bronze layer with Delta Live Table and I want to do it using Azure Data Factory and I want the load to be a Snapshot batch load, not an incremental lo...
Hi @Parsa Bahraminejad Thank you for posting your question in our community! We are happy to assist you.To help us provide you with the most accurate information, could you please take a moment to review the responses and select the one that best an...
Hi All, I recently published a streaming data comparison between Snowflake and Databricks. Hope you enjoy! Please let me know what you think! https://medium.com/@24chynoweth/data-streaming-at-scale-databricks-and-snowflake-ca65a2401649
I have a workspace in GCP that's reading from a delta-shared dataset hosted in S3. When trying to run a very basic DLT pipeline, I'm getting the below error. Any help would be awesome!Code:import dlt
@dlt.table
def fn():
return (spark.readStr...
@Charlie You :The error message you're encountering suggests a timeout issue when reading from the Delta-shared dataset hosted in S3. There are a few potential reasons and solutions you can explore:Network connectivity: Verify that the network conne...
I have the following code:from pyspark.sql.functions import *
!pip install dbl-tempo
from tempo import TSDF
from pyspark.sql.functions import *
# interpolate target_cols column linearly for tsdf dataframe
def interpolate_tsdf(tsdf_data, target_c...