Data Engineering

Forum Posts

Sorted by:

by chitrar • New Contributor III

03-06-2025 8:02:41 AM

779 Views
9 replies
4 kudos

workflow/lakeflow -why does it not capture all the metadata of the jobs/tasks

Hi, I see with unity catalog we have the workflow and now the lakeflow schema. I guess the intention is to capture audit logs of changes/ monitor runs but I wonder why we don't have all the metadata info on the jobs /tasks too for a given job =...

Data Engineering

779 Views
9 replies
4 kudos

03-06-2025 8:02:41 AM

View Replies

Latest Reply

chitrar
New Contributor III

a month ago

4 kudos

@Sujitha so, we can expect these enhancements in the "near" future ?

4 kudos

a month ago

8 More Replies

by antoniomf • New Contributor

a month ago

332 Views
0 replies
0 kudos

Bug Delta Live Tables - Checkpoint

Hello, I've encountered an issue with Delta Live Table in both my Development and Production Workspaces. The data is arriving correctly in my Azure Storage Account; however, the checkpoint is being stored in the path dbfs:/. I haven't modified the St...

Data Engineering

332 Views
0 replies
0 kudos

a month ago

by jeremy98 • Contributor III

a month ago

291 Views
1 replies
0 kudos

if else condition task doubt

Hi community,The if else condition task couldn't be used as real if condition? Seems that if the condition goes to False the entire job will be stop. Is it a right behaviour?

Data Engineering

291 Views
1 replies
0 kudos

a month ago

View Replies

Latest Reply

jeremy98
Contributor III

a month ago

0 kudos

Hi, I found that the problem is here: - task_key: get_email_infos max_retries: 3 min_retry_interval_millis: 150000 depends_on: - task_key: check_type_of_trigger outcome: "true" ...

0 kudos

a month ago

by htu • New Contributor III

05-02-2024 11:14:32 PM

10517 Views
19 replies
23 kudos

Installing Databricks Connect breaks pyspark local cluster mode

Hi, It seems that when databricks-connect is installed, pyspark is at the same time modified so that it will not anymore work with local master node. This has been especially useful in testing, when unit tests for spark-related code without any remot...

Data Engineering

10517 Views
19 replies
23 kudos

05-02-2024 11:14:32 PM

View Replies

Latest Reply

Martinitus
New Contributor III

a month ago

23 kudos

I agree with most of the comments above, that the current approach of databricks-connect is not great (it sucks to be frankly). Its an issue that was bugging me since more than 2 years now.By the way, i checked how this could be done with poetry and ...

23 kudos

a month ago

18 More Replies

by turagittech • New Contributor III

03-11-2025 11:17:59 PM

314 Views
3 replies
1 kudos

External Table refresh

Hi,I have a blob storage area in Azure where json files are being created. I can create an external table on the storage blob container, but when new files are added I don't see extra rows to query the data. Is there a better approach to accessing th...

Data Engineering

314 Views
3 replies
1 kudos

03-11-2025 11:17:59 PM

View Replies

Latest Reply

Nivethan_Venkat
Contributor

a month ago

1 kudos

Hi @turagittech,The above error indicates that your table seems to be in DELTA format. Please check the table creation statement, if the table format is JSON or DELTA.PS: By default, if you are not specifying any format while creating the table on to...

1 kudos

a month ago

2 More Replies

by Walter_N • New Contributor II

a month ago

284 Views
2 replies
0 kudos

Resolved! DLT pipeline task with full refresh once in a while

Hi all, I'm using Databricks workflow with some dlt pipeline tasks. These tasks requires a full refresh at some times due to schema changes in the source. I've been doing the full refresh manually or set the full refresh option in the job settings, t...

Data Engineering

284 Views
2 replies
0 kudos

a month ago

View Replies

Latest Reply

MariuszK
Contributor III

a month ago

0 kudos

Hi,Did you check a possibility to use if/else task? You could define some scriteria and pass it from a notebok that will check if it's time for full refresh or just resfres.

0 kudos

a month ago

1 More Replies

by scorpusfx1 • New Contributor II

03-13-2025 2:01:45 AM

327 Views
4 replies
0 kudos

Delta Live Table SCD2 performance issue

Hi Community,I am working on ingestion pipelines that take data from Parquet files (200 MB per day) and integrate them into my Lakehouse. This data is used to create an SCD Type 2 using apply_changes, with the row ID as the key and the file date as t...

Data Engineering

apply_change

dlt

SCD2

327 Views
4 replies
0 kudos

03-13-2025 2:01:45 AM

View Replies

Latest Reply

Stefan-Koch
Valued Contributor II

a month ago

0 kudos

hi @scorpusfx1 What kind of source data do you have? Are these parquet files daily full snapshots of source tables? If so, you should use apply_changes_from_snapshot, which is exactly built for this use case. https://docs.databricks.com/aws/en/dlt/py...

0 kudos

a month ago

3 More Replies

by soumiknow • Contributor

12-10-2024 11:00:20 PM

2103 Views
16 replies
1 kudos

Resolved! BQ partition data deleted fully even though 'spark.sql.sources.partitionOverwriteMode' is DYNAMIC

We have a date (DD/MM/YYYY) partitioned BQ table. We want to update a specific partition data in 'overwrite' mode using PySpark. So to do this, I applied 'spark.sql.sources.partitionOverwriteMode' to 'DYNAMIC' as per the spark bq connector documentat...

Data Engineering

2103 Views
16 replies
1 kudos

12-10-2024 11:00:20 PM

View Replies

Latest Reply

VZLA
Databricks Employee

01-08-2025 4:51:00 AM

1 kudos

@soumiknow , Just checking if there are any further questions, and did my last comment help?

1 kudos

01-08-2025 4:51:00 AM

15 More Replies

by Livingstone • New Contributor II

08-19-2024 7:54:14 AM

1352 Views
3 replies
3 kudos

Install maven package to serverless cluster

My task is to export data from CSV/SQL into Excel format with minimal latency. To achieve this, I used a Serverless cluster.Since PySpark does not support saving in XLSX format, it is necessary to install the Maven package spark-excel_2.12. However, ...

Data Engineering

1352 Views
3 replies
3 kudos

08-19-2024 7:54:14 AM

View Replies

Latest Reply

GalenSwint
New Contributor II

03-13-2025 7:47:32 PM

3 kudos

I also have this question and wondered what the options were / are

3 kudos

03-13-2025 7:47:32 PM

2 More Replies

by analytics_eng • New Contributor II

01-09-2025 12:24:05 AM

1718 Views
4 replies
1 kudos

Connection reset by peer logging when importing custom package

Hi! I'm trying to import a custom package I published to Azure Artifacts, but I keep seeing the INFO logging below, which I don't want to display. The package was installed correctly on the cluster, and it imports successfully, but the log still appe...

Data Engineering

1718 Views
4 replies
1 kudos

01-09-2025 12:24:05 AM

View Replies

Latest Reply

siklosib
New Contributor II

03-13-2025 4:10:47 PM

1 kudos

What solved this problem for me is to remove the root logger configuration from the logging config and create another one within the loggers section. See below.{ 'version': 1, 'disable_existing_loggers': False, 'formatters': { 'simple...

1 kudos

03-13-2025 4:10:47 PM

3 More Replies

by fscaravelli • New Contributor

03-13-2025 11:18:57 AM

308 Views
0 replies
0 kudos

Ingest files from GCS with Auto Loader in DLT pipeline running on AWS

I have some DLT pipelines working fine ingesting files from S3. Now I'm trying to build a pipeline to ingest files from GCS using Auto Loader. I'm running Databricks on AWS.The code I have:import dlt import json from pyspark.sql.functions import col ...

Data Engineering

308 Views
0 replies
0 kudos

03-13-2025 11:18:57 AM

by nhuthao • New Contributor II

03-05-2025 10:50:23 PM

526 Views
5 replies
1 kudos

SQL is not enabled

Hi All,I have registered on Databricks successfully. However, SQL is not enabled.Please help me how to activate SQL.Thank you very much,

Data Engineering

526 Views
5 replies
1 kudos

03-05-2025 10:50:23 PM

View Replies

Latest Reply

Stefan-Koch
Valued Contributor II

03-13-2025 10:14:17 AM

1 kudos

@nhuthao How did you solved it? What was the problem?

1 kudos

03-13-2025 10:14:17 AM

4 More Replies

by cgrant • Databricks Employee

06-23-2021 2:28:51 PM

14200 Views
3 replies
4 kudos

What is the difference between OPTIMIZE and Auto Optimize?

I see that Delta Lake has an OPTIMIZE command and also table properties for Auto Optimize. What are the differences between these and when should I use one over the other?

Data Engineering

14200 Views
3 replies
4 kudos

06-23-2021 2:28:51 PM

View Replies

Latest Reply

basit
New Contributor II

03-13-2025 10:08:06 AM

4 kudos

Is this still valid answer in 2025 ? https://docs.databricks.com/aws/en/delta/tune-file-size#auto-compaction-for-delta-lake-on-databricks

4 kudos

03-13-2025 10:08:06 AM

2 More Replies

by Amit_Dass_Chmp • New Contributor III

03-13-2025 9:14:55 AM

253 Views
0 replies
0 kudos

query on Databricks Arc :ARC will not work on 13.x or greater runtime

I have a query on Databricks Arc , is this statement true - Databricks Runtime Requirements for implementing Arc:ARC requires Databricks ML Runtime 12.2LTS. ARC will not work on 13.x or greater runtime

Data Engineering

253 Views
0 replies
0 kudos

03-13-2025 9:14:55 AM

by ShivangiB • New Contributor III

03-05-2025 10:12:51 PM

413 Views
2 replies
0 kudos

Resolved! Fatctors deciding to choose between zorder, partitioning and liquid clustering

What are the factors on which we should choose the optimization approach

Data Engineering

413 Views
2 replies
0 kudos

03-05-2025 10:12:51 PM

View Replies

Latest Reply

canadiandataguy
New Contributor III

03-13-2025 8:36:04 AM

0 kudos

I have built a decision tree on how to think about it https://www.canadiandataguy.com/p/optimizing-delta-lake-tables-liquid?triedRedirect=true

0 kudos

03-13-2025 8:36:04 AM

1 More Replies

User

Count

1611

768

345

286

252

Databricks Community

Forum Posts

workflow/lakeflow -why does it not capture all the metadata of the jobs/tasks

Bug Delta Live Tables - Checkpoint

if else condition task doubt

Installing Databricks Connect breaks pyspark local cluster mode

External Table refresh

Resolved! DLT pipeline task with full refresh once in a while

Delta Live Table SCD2 performance issue

Resolved! BQ partition data deleted fully even though 'spark.sql.sources.partitionOverwriteMode' is DYNAMIC

Install maven package to serverless cluster

Connection reset by peer logging when importing custom package

Ingest files from GCS with Auto Loader in DLT pipeline running on AWS

SQL is not enabled

What is the difference between OPTIMIZE and Auto Optimize?

query on Databricks Arc :ARC will not work on 13.x or greater runtime

Resolved! Fatctors deciding to choose between zorder, partitioning and liquid clustering

Join Us as a Local Community Builder!

Databricks data engineer associate exam

How to delete/empty notebook output

Databricks Cluster Policies

toml file syntax highlighting

Materialized Views Compute