cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Forum Posts

HoussemBL
by New Contributor II
  • 21 Views
  • 1 replies
  • 0 kudos

Impact of deleting workspace on associated catalogs

Hello Community,I have a specific scenario regarding Unity Catalog and workspace deletion that I'd like to clarify:Current Setup:Two DataBricks workspaces: W1 and W2Single Unity Catalog instanceCatalog1: Created in W1, shared and accessible in W2Cata...

  • 21 Views
  • 1 replies
  • 0 kudos
Latest Reply
Alberto_Umana
Databricks Employee
  • 0 kudos

Hi @HoussemBL  When you delete a Databricks workspace, it does not directly impact the Unity Catalog or the data within it. Unity Catalog is a separate entity that manages data access and governance across multiple workspaces. Here’s what happens in ...

  • 0 kudos
minhhung0507
by New Contributor II
  • 66 Views
  • 2 replies
  • 0 kudos

Handling Dropped Records in Delta Live Tables with Watermark - Need Optimization Strategy

Hi Databricks Community,I'm encountering an issue with watermarks in Delta Live Tables that's causing data loss in my streaming pipeline. Let me explain my specific problem:Current SituationI've implemented watermarks for stateful processing in my De...

  • 66 Views
  • 2 replies
  • 0 kudos
Latest Reply
minhhung0507
New Contributor II
  • 0 kudos

 Dear @Walter_C, thank you for your detailed response regarding watermark handling in Delta Live Tables (DLT). I appreciate the guidance provided, but I would like further clarification on a couple of points related to our use case.1. Auto-Saving Dro...

  • 0 kudos
1 More Replies
rt-slowth
by Contributor
  • 1092 Views
  • 1 replies
  • 0 kudos

how to use dlt module in streaming pipeline

If anyone has example code for building a CDC live streaming pipeline generated by AWS DMS using import dlt, I'd love to see it.I'm currently able to see the parquet file starting with Load on the first full load to S3 and the cdc parquet file after ...

  • 1092 Views
  • 1 replies
  • 0 kudos
Latest Reply
cgrant
Databricks Employee
  • 0 kudos

There is a blogpost for this that includes example code that you can find here

  • 0 kudos
GS_S
by New Contributor
  • 159 Views
  • 7 replies
  • 0 kudos

Resolved! Error during merge operation: 'NoneType' object has no attribute 'collect'

Why does merge.collect() not return results in access mode: SINGLE_USER, but it does in USER_ISOLATION? I need to log the affected rows (inserted and updated) and can’t find a simple way to get this data in SINGLE_USER mode. Is there a solution or an...

  • 159 Views
  • 7 replies
  • 0 kudos
Latest Reply
Walter_C
Databricks Employee
  • 0 kudos

15.4 does not directly required the serverless but for fine-grained it indeed requires it to run it on Single User as mentioned  This data filtering is performed behind the scenes using serverless compute. In terms of costs:Customers are charged for ...

  • 0 kudos
6 More Replies
htu
by New Contributor III
  • 5606 Views
  • 13 replies
  • 20 kudos

Installing Databricks Connect breaks pyspark local cluster mode

Hi, It seems that when databricks-connect is installed, pyspark is at the same time modified so that it will not anymore work with local master node. This has been especially useful in testing, when unit tests for spark-related code without any remot...

  • 5606 Views
  • 13 replies
  • 20 kudos
Latest Reply
mslow
New Contributor
  • 20 kudos

I think in case you're deliberately installing databricks-connect, then you need to handle the local spark session creation.My issue is that I'm using databricks-dlt package which installs databricks-connect as a dependency. In the latest package ver...

  • 20 kudos
12 More Replies
stadelmannkevin
by New Contributor
  • 168 Views
  • 4 replies
  • 2 kudos

init_script breaks Notebooks

 Hi everyoneWe would like to use our private company Python repository for installing Python libraries with pip install.To achieve this, I created a simple script which sets the index-url configuration of pip to our private repoI set this script as a...

  • 168 Views
  • 4 replies
  • 2 kudos
Latest Reply
Walter_C
Databricks Employee
  • 2 kudos

Did you also try cloning the cluster or using other cluster for the testing? The metastore down is normally a Hive Metastore issue, should not be impacting here, but you could check for more details on the error on the log4j under Driver logs.

  • 2 kudos
3 More Replies
shusharin_anton
by New Contributor
  • 87 Views
  • 1 replies
  • 1 kudos

Resolved! Sort after update on DWH

Running query on serverless DWH:UPDATEcatalog.schema.tableSETcol_tmp = CAST(col as DECIMAL(30, 15))In query profiling, it has some sort and shuffle stages in graph.Table has partition by partition_date columnSome details in sort node mentions that so...

  • 87 Views
  • 1 replies
  • 1 kudos
Latest Reply
Alberto_Umana
Databricks Employee
  • 1 kudos

Hi @shusharin_anton, The sort and shuffle stages in your query profile are likely triggered by the need to redistribute and order the data based on the partition_date column. This behavior can be attributed to the way Spark handles data partitioning ...

  • 1 kudos
JothyGanesan
by New Contributor II
  • 66 Views
  • 2 replies
  • 0 kudos

DLT Merge tables into Delta

We are trying to load a Delta table from streaming tables using DLT. This target table needs a MERGE of 3 source tables. But when we use the DLT command with merge it says Merge is not supported. Is this anything related to DLT version? Please help u...

  • 66 Views
  • 2 replies
  • 0 kudos
Latest Reply
JothyGanesan
New Contributor II
  • 0 kudos

@Alberto_Umana Thank you for the quick reply. But how are we to use the above, this looks like structured streaming with CDF mode.But currently our tables being in Unity catalog, finding the start version and end version is taking huge time as the ta...

  • 0 kudos
1 More Replies
Mcnamara
by New Contributor
  • 107 Views
  • 1 replies
  • 0 kudos

Pyspark and SQL Warehouse

If i write pyspark code and i need to get data in powerbi will it be possible to merge data into one semantic model? For instance the pipeline were developed using SQL so its directly compatible with SQL Warehouse 

  • 107 Views
  • 1 replies
  • 0 kudos
Latest Reply
Walter_C
Databricks Employee
  • 0 kudos

Yes, it is possible to merge data into one semantic model in Power BI when using PySpark code to get data. Databricks supports integration with Power BI, allowing you to create a unified semantic model. You can develop your data pipeline using PySpar...

  • 0 kudos
infinitylearnin
by New Contributor II
  • 72 Views
  • 0 replies
  • 0 kudos

Learn Data Engineering on Databricks Step By Step.

They say we should build bridges along the paths we’ve already traveled, making it easier for others to follow. Learning Data Engineering has often been a confusing journey for many, especially when it comes to figuring out where to start.I faced thi...

  • 72 Views
  • 0 replies
  • 0 kudos
soumiknow
by New Contributor III
  • 266 Views
  • 8 replies
  • 0 kudos

BQ partition data deleted fully even though 'spark.sql.sources.partitionOverwriteMode' is DYNAMIC

We have a date (DD/MM/YYYY) partitioned BQ table. We want to update a specific partition data in 'overwrite' mode using PySpark. So to do this, I applied 'spark.sql.sources.partitionOverwriteMode' to 'DYNAMIC' as per the spark bq connector documentat...

  • 266 Views
  • 8 replies
  • 0 kudos
Latest Reply
Walter_C
Databricks Employee
  • 0 kudos

I reviewed this with an Spark resource, seems that for this Indirect method will be required, you can follow information in https://github.com/GoogleCloudDataproc/spark-bigquery-connector?tab=readme-ov-file#indirect-write 

  • 0 kudos
7 More Replies
ashraf1395
by Valued Contributor
  • 162 Views
  • 3 replies
  • 0 kudos

Getting error while using Live.target_table in dlt pipeline

I have created a target table in the same dlt pipeline. But when I read that table in different block of notebook with Live.table_path. It is not able to read it Here is my code block 1 Creating a streaming table # Define metadata tables catalog = sp...

  • 162 Views
  • 3 replies
  • 0 kudos
Latest Reply
ashraf1395
Valued Contributor
  • 0 kudos

Cant we use Live.table_name on a target dlt table with @Dlt.append_flow decorator.If yes can you share the code bcz when I tried I am getting error.

  • 0 kudos
2 More Replies
ashraf1395
by Valued Contributor
  • 182 Views
  • 2 replies
  • 2 kudos

Resolved! Old files also getting added in dlt autoloader

So , I am using autoloader in a dlt pipeline for my data ingestion. I am using @Dlt.append_flow because I have data to load from multiple sources.When I load a new file say x it has 3 rows my target gets 3 rows. But next even if I don't load any file...

  • 182 Views
  • 2 replies
  • 2 kudos
Latest Reply
Alberto_Umana
Databricks Employee
  • 2 kudos

Hi @ashraf1395, Just a few comments about your question: The cloudFiles source in Databricks is designed for incremental file processing. However, it depends on the checkpoint directory to track which files have been processed. The cloudFiles.include...

  • 2 kudos
1 More Replies
AlexeyEgorov
by New Contributor II
  • 219 Views
  • 1 replies
  • 0 kudos

foreach execution faulty with number of partitions >= worker cores

In order to download multiple wikipedia dumps, I collected the links in the list and wanted to use foreach method to iterate over those links and apply a UDF that downloads the data in the previously created volume structure. However, I ran into an i...

  • 219 Views
  • 1 replies
  • 0 kudos
Latest Reply
Walter_C
Databricks Employee
  • 0 kudos

It seems like the issue you're encountering with incomplete file downloads when using the foreach method and a UDF in Spark might be related to the number of partitions and how tasks are distributed across them. Here are a few points to consider: Ta...

  • 0 kudos
aupres
by New Contributor III
  • 126 Views
  • 1 replies
  • 0 kudos

how to generate log files on specific folders

Hello! My environments are like below, OS : Windows 11 Spark : spark-4.0.0-preview2-bin-hadoop3 And the configuration of spark files 'spark-defaults.conf' and 'log4j2.properties'spark-defaults.conf spark.eventLog.enabled true spark.event...

  • 126 Views
  • 1 replies
  • 0 kudos
Latest Reply
Alberto_Umana
Databricks Employee
  • 0 kudos

Hi @aupres, Do you see any failures in spark logs? Few things to validate: It appears that the log files are not being generated in the specified directory due to a misconfiguration in your log4j2.properties fil   Check the Appender Configuration: En...

  • 0 kudos
Labels