Data Engineering

Forum Posts

Sorted by:

by kweks970 • New Contributor

04-25-2025 12:57:36 PM

2838 Views
1 replies
0 kudos

DEV and PROD

"SELECT * FROM' data call on my table in PROD is giving all the rows of data (historical data), but a call on my table in DEV is giving me just one row of data (current one row of historical data). what could be the problem??

Data Engineering

2838 Views
1 replies
0 kudos

04-25-2025 12:57:36 PM

View Replies

Latest Reply

Louis_Frolio
Databricks Employee

04-25-2025 2:51:15 PM

0 kudos

Please don't cross post. Thanks, Louis.

0 kudos

04-25-2025 2:51:15 PM

by AlexMc • New Contributor III

04-24-2025 9:59:26 AM

1829 Views
6 replies
1 kudos

Resolved! GET /api/2.2/jobs/list Ordering

Hi there!I am calling the job list API (via the Python SDK):GET /api/2.2/jobs/listdocs.databricks.com/api/workspace/jobs/listDoes anyone know what ordering is applied / calculated for the list of jobs? Is it consistent or random?Is it by creation tim...

Data Engineering

1829 Views
6 replies
1 kudos

04-24-2025 9:59:26 AM

View Replies

Latest Reply

AlexMc
New Contributor III

04-25-2025 1:42:57 PM

1 kudos

Thanks both - this was very helpful!

1 kudos

04-25-2025 1:42:57 PM

5 More Replies

by Christian_C • New Contributor II

04-17-2025 8:47:05 AM

2239 Views
7 replies
0 kudos

Google Pub Sub and Delta live table

I am using delta live table and pub sub to ingest message from 30 different topics in parallel. I noticed that initialization time can be very long around 15 minutes. Does someone knows how to reduced initialization time in dlt ? Thanks You

Data Engineering

2239 Views
7 replies
0 kudos

04-17-2025 8:47:05 AM

View Replies

Latest Reply

Louis_Frolio
Databricks Employee

04-23-2025 8:59:52 AM

0 kudos

Classic clusters can take up to seven minutes to be acquired, configured, and deployed, with most of this time spent waiting for the cloud service to allocate virtual machines. In contrast, serverless clusters typically start in under eight seconds. ...

0 kudos

04-23-2025 8:59:52 AM

6 More Replies

by BF7 • Contributor

04-25-2025 6:58:18 AM

1463 Views
3 replies
2 kudos

Resolved! How can we get AutoLoader to detect a file footer?

We are dealing with CSVs that have footers in them. When we have an empty file, the presence of this footer seems to impair the schema inferencing of AutoLoader, because of the footer.I know where is a header = true parameter, but I don't see a foote...

Data Engineering

1463 Views
3 replies
2 kudos

04-25-2025 6:58:18 AM

View Replies

Latest Reply

Louis_Frolio
Databricks Employee

04-25-2025 8:33:24 AM

2 kudos

To be clear, when you say footer are you referring to the last row of the tuple? e.g. Header = row 1, Footer = row_last.

2 kudos

04-25-2025 8:33:24 AM

2 More Replies

by Yuki • Contributor

04-24-2025 9:55:59 PM

1115 Views
2 replies
1 kudos

Resolved! Can we implement Unity Catalog table lifecycle?

I want to delete tables that haven't been selected or otherwise accessed for several months.I can see the Delta table history, but I can only catch the DDL or update/insert/delete and can't catch "select".I realized that the Unity Catalog insight, ht...

Data Engineering

1115 Views
2 replies
1 kudos

04-24-2025 9:55:59 PM

View Replies

Latest Reply

Yuki
Contributor

04-25-2025 8:21:28 AM

1 kudos

Hi @Renu_ ,I appreciate for your clear response. I now have a better understanding and will work with our admin team to develop a strategy.Thank you.

1 kudos

04-25-2025 8:21:28 AM

1 More Replies

by Bart_DE • New Contributor II

04-24-2025 3:59:40 AM

1887 Views
2 replies
0 kudos

Resolved! Concurency behavior with merge operations

Hi community,I have this case right now in project where i have to develop a solution that will prevent duplicate data from being ingested twice to delta lake. Some of our data suppliers at a rare occurence are sending us the same dataset in two diff...

Data Engineering

1887 Views
2 replies
0 kudos

04-24-2025 3:59:40 AM

View Replies

Latest Reply

Walter_C
Databricks Employee

04-24-2025 5:50:05 AM

0 kudos

Your idea of using a log table to track processed ingestions and leveraging a MERGE operation in your pipeline is a sound approach for preventing duplicate data ingestion into Delta Lake. Delta Lake's ACID transactions and its support for concurrency...

0 kudos

04-24-2025 5:50:05 AM

1 More Replies

by Anonymous • Not applicable

06-25-2021 1:56:40 PM

3185 Views
2 replies
0 kudos

DBFS Permissions

if there is permission control on the folder/file level in DBFS.e.g. if a team member uploads a file to /Filestore/Tables/TestData/testfile, could we mask permissions on TestData and/or testfile?

Data Engineering

3185 Views
2 replies
0 kudos

06-25-2021 1:56:40 PM

View Replies

Latest Reply

brickster_2018
Databricks Employee

06-25-2021 2:23:59 PM

0 kudos

DBFS does not have ACL at this point

0 kudos

06-25-2021 2:23:59 PM

1 More Replies

by sahil3 • New Contributor

04-24-2025 11:44:13 PM

650 Views
1 replies
0 kudos

NOT ABLE TO ATTACH CLUSTRE

notebook detached-exception when creating execution context:java.until.concurrent.timeoutexceoption:timed out after 15 seconds

Data Engineering

650 Views
1 replies
0 kudos

04-24-2025 11:44:13 PM

View Replies

Latest Reply

RiyazAliM
Honored Contributor

04-25-2025 12:31:21 AM

0 kudos

Hey @sahil3 Try detach and re-attaching the notebook to the notebook. Please note that this will clear the state of the notebook.if the issue persists, try restarting the cluster.Best,

0 kudos

04-25-2025 12:31:21 AM

by rak_haq • New Contributor III

04-23-2025 4:20:22 PM

1992 Views
3 replies
1 kudos

Resolved! How to use read_kafka() SQL with secret()?

Hi,I want to read data from the Azure Event Hub using SQL.Can someone please give me an executable example where you can also use the connection string from the event hub using the SQL function secret(), for example?This is what i tried but it Databr...

Data Engineering

azure

event_hub

kafka

sql

streaming

1992 Views
3 replies
1 kudos

04-23-2025 4:20:22 PM

View Replies

Latest Reply

rak_haq
New Contributor III

04-24-2025 2:55:06 PM

1 kudos

I found the solution und could successfully establish a connection to Event-Hub. SELECT cast(value as STRING) as raw_json, current_timestamp() as processing_time FROM read_kafka( bootstrapServers => '<YOUR EVENT-HUB NAMESPACE>.servicebus.windows.n...

1 kudos

04-24-2025 2:55:06 PM

2 More Replies

by Ajay-Pandey • Databricks MVP

03-19-2024 3:49:13 AM

5200 Views
5 replies
0 kudos

On-behalf-of token creation for service principals is not enabled for this workspace

Hi AllI just wanted to create PAT for Databricks Service Principle but getting below code while hitting API or using CLI - Please help me to create PAT for the same.#dataengineering #databricks

Data Engineering

community

Databricks

5200 Views
5 replies
0 kudos

03-19-2024 3:49:13 AM

View Replies

Latest Reply

JackB
New Contributor II

04-24-2025 1:43:49 PM

0 kudos

You can generate the token while logged in as the Service Principle via the Azure CLI in a Command Prompt window. To do so, make sure to install the Azure CLI and the Databricks CLI with it.Install the Azure CLI for Windows | Microsoft LearnInstall ...

0 kudos

04-24-2025 1:43:49 PM

4 More Replies

by Harrison • New Contributor II

08-01-2023 10:52:59 AM

2419 Views
1 replies
0 kudos

Reading CloudWatch Logs from AWS Kinesis

If you have AWS CloudWatch subscribed to write out logs to AWS Kinesis, the Kinesis stream is base64 encoded and the CloudWatch logs are GZIP compressed. The challenge we faced was how to address that in pyspark to be able to read the data. We were ...

Data Engineering

2419 Views
1 replies
0 kudos

08-01-2023 10:52:59 AM

View Replies

Latest Reply

oblikas
New Contributor II

04-24-2025 12:27:35 PM

0 kudos

Thank you so much, this is very helpful

0 kudos

04-24-2025 12:27:35 PM

by BF7 • Contributor

04-24-2025 10:16:58 AM

2127 Views
3 replies
3 kudos

Resolved! What is the difference between spark inferschema and cloudFiles.inferColumnTypes?

We have been using spark.read with inferSchema = True to validate AutoLoader schema inferencing. But I have a suspicion that they do these differently from each other and may not always yield the identical results.Has anyone ever answered this questi...

Data Engineering

2127 Views
3 replies
3 kudos

04-24-2025 10:16:58 AM

View Replies

Latest Reply

lingareddy_Alva
Esteemed Contributor

04-24-2025 10:38:47 AM

3 kudos

Hi @BF7 Yes — there is a difference between how spark.read(...).option("inferSchema", "true")and Auto Loader's schema inference (cloudFiles.schemaHints, cloudFiles.inferColumnTypes, etc.) work.They are not guaranteed to produce identical results,Key ...

3 kudos

04-24-2025 10:38:47 AM

2 More Replies

by Unimog • New Contributor III

04-23-2025 8:35:49 PM

1538 Views
3 replies
1 kudos

Resolved! springml sftp with spark 3.x

Is there a version of springml spark-sftp that works with spark 3.x and scala 2.12? If so can you point me to it or how to load it in my compute?

Data Engineering

1538 Views
3 replies
1 kudos

04-23-2025 8:35:49 PM

View Replies

Latest Reply

Louis_Frolio
Databricks Employee

04-24-2025 11:40:12 AM

1 kudos

For Python you might want to look at Paramiko, it seems that it might be an option. You could also look at ETL tools like Airbyte, Rivery, CData, etc.

1 kudos

04-24-2025 11:40:12 AM

2 More Replies

by ÓscarHernández • New Contributor II

04-22-2025 7:05:58 AM

3384 Views
3 replies
0 kudos

SQLSTATE: XX000 The Spark SQL phase planning failed with an internal error.

Hello everyone,I am currently working with a SQL Warehouse and have been getting the following error message:[INTERNAL_ERROR ] The Spark SQL phase planning failed with an internal error. You hit a bug in Spark or the Spark plugins you use. Please, re...

Data Engineering

3384 Views
3 replies
0 kudos

04-22-2025 7:05:58 AM

View Replies

Latest Reply

ÓscarHernández
New Contributor II

04-23-2025 10:54:16 PM

0 kudos

I have tried to simplify the query as much as possible to see if that helps but the bug still persists. The problem should be something with the way Databricks treats columns passed as arguments for a function.I tried these queries:select * FROM VALU...

0 kudos

04-23-2025 10:54:16 PM

2 More Replies

by minhhung0507 • Valued Contributor

04-15-2025 9:13:27 PM

9014 Views
15 replies
3 kudos

API for Restarting Individual Failed Tasks within a Job?

Hi everyone,I'm exploring ways to streamline my workflow in Databricks and could really use some expert advice. In my current setup, I have a job (named job_silver) with multiple tasks (e.g., task 1, task 2, task 3). When one of these tasks fails—say...

Data Engineering

9014 Views
15 replies
3 kudos

04-15-2025 9:13:27 PM

View Replies

Latest Reply

RiyazAliM
Honored Contributor

04-16-2025 10:43:07 PM

3 kudos

Hey @minhhung0507 - quick question - what is the cluster type you're using to run your workflow?I'm using a shared, interactive cluster, so I'm passing the parameter {'existing_cluster_id' : task['existing_cluster_id']}in the payload. This parameter ...

3 kudos

04-16-2025 10:43:07 PM

14 More Replies

Databricks Community

Forum Posts

DEV and PROD

Resolved! GET /api/2.2/jobs/list Ordering

Google Pub Sub and Delta live table

Resolved! How can we get AutoLoader to detect a file footer?

Resolved! Can we implement Unity Catalog table lifecycle?

Resolved! Concurency behavior with merge operations

DBFS Permissions

NOT ABLE TO ATTACH CLUSTRE

Resolved! How to use read_kafka() SQL with secret()?

On-behalf-of token creation for service principals is not enabled for this workspace

Reading CloudWatch Logs from AWS Kinesis

Resolved! What is the difference between spark inferschema and cloudFiles.inferColumnTypes?

Resolved! springml sftp with spark 3.x

SQLSTATE: XX000 The Spark SQL phase planning failed with an internal error.

API for Restarting Individual Failed Tasks within a Job?

File Arrival Trigger - Multiple tables

Issue while handling Deletes and Inserts in Struct...

DLT with CDC and schema changes in streaming pipel...

how to update not tracked column only in new row v...

Databricks Cost Estimation Template