Data Engineering

Forum Posts

Sorted by:

Start a conversation

by yinan • New Contributor

Tuesday

139 Views
5 replies
2 kudos

How does Databricks read data from an offline CDH environment?

Data Engineering

139 Views
5 replies
2 kudos

Tuesday

View Replies

Latest Reply

Khaja_Zaffer
Contributor

Tuesday

2 kudos

Hello @yinan Good day!!Databricks, being a cloud-based platform, does not have direct built-in support for reading data from a truly air-gapped (completely offline, no network connectivity) Cloudera Distribution for Hadoop (CDH) environment. In such...

2 kudos

Tuesday

4 More Replies

by Kurgod • New Contributor II

3 weeks ago

159 Views
2 replies
0 kudos

Using Databricks to transform cloudera lakehouse on-prem without bringing the data to cloud

I am looking for a solution to connect databricks to cloudera lakehouse hosted on-prem and transform the data using databricks without bringing the data to databricks delta tables or cloud storage. once the transformation is done the data need to be ...

Data Engineering

159 Views
2 replies
0 kudos

3 weeks ago

View Replies

Latest Reply

BR_DatabricksAI
Contributor III

3 weeks ago

0 kudos

Hello, What is your data volume? You can connect using jdbc/odbc but this process will be slower if the data volume is too high.Another way of connecting is if your cloudera storage is in HDFS then you can also connect through HDFS API as well.

0 kudos

3 weeks ago

1 More Replies

by azam-io • New Contributor II

08-01-2025 2:47:07 AM

618 Views
4 replies
2 kudos

How can I structure pipeline-specific job params separately in Databricks Asset Bundle.

Hi all, I am working with databricks asset bundle and want to separate environment-specific job params (for example, for "env" and "dev") for each pipeline within my bundle. I need each pipeline to have its own job params values for different environ...

Data Engineering

618 Views
4 replies
2 kudos

08-01-2025 2:47:07 AM

View Replies

Latest Reply

Michał
New Contributor

Wednesday

2 kudos

Hi azam-io, were you able to solve your problem? Are you trying to have different parameters depending on the environment, or a different parameter value? I think the targets would allow to specify different parameters per environment / target. As fo...

2 kudos

Wednesday

3 More Replies

by seefoods • Contributor II

06-17-2025 8:37:23 AM

2403 Views
2 replies
1 kudos

Resolved! assets bundle

Hello Guys,I am working on assets bundle. So i want to make it generic for all team like ( analytics, data engineering), Someone could you share a best practice for this purpose ? Cordially,

Data Engineering

2403 Views
2 replies
1 kudos

06-17-2025 8:37:23 AM

View Replies

Latest Reply

Michał
New Contributor

Wednesday

1 kudos

Hi seefoods, Were you able to achieve that generic asset bundle setup? I've been working on something, potentially, similar, and I'd be happy to discuss it, hoping to share experiences. While what I have works for a few teams, it is focused on declar...

1 kudos

Wednesday

1 More Replies

by SharathE • New Contributor III

08-20-2024 9:40:03 AM

1908 Views
3 replies
1 kudos

Incremental refresh of materialized view in serverless DLT

Hello, Every time that I run a delta live table materialized view in serverless , I get a log of "COMPLETE RECOMPUTE" . How can I achieve incremental refresh in serverless in DLT pipelines?

Data Engineering

1908 Views
3 replies
1 kudos

08-20-2024 9:40:03 AM

View Replies

Latest Reply

drewipson
New Contributor III

10-31-2024 12:48:18 PM

1 kudos

Make sure you are using the aggregates and SQL restrictions outlined in this article. https://docs.databricks.com/en/optimizations/incremental-refresh.htmlIf a SQL function is non-deterministic (current_timestamp() is a common one) you will have a CO...

1 kudos

10-31-2024 12:48:18 PM

2 More Replies

by korijn • New Contributor II

11-28-2024 7:08:27 AM

664 Views
4 replies
0 kudos

Git integration inconsistencies between git folders and job git

It's a little confusing and limiting that the git integration support is inconsistent between the two options available.Sparse checkout is only supported when using a workspace Git folder, and checking out by commit hash is only supported when using ...

Data Engineering

664 Views
4 replies
0 kudos

11-28-2024 7:08:27 AM

View Replies

Latest Reply

_J
New Contributor II

Wednesday

0 kudos

Same here, could be a good improvement for the jobs layer guys!

0 kudos

Wednesday

3 More Replies

by IONA • New Contributor III

a week ago

385 Views
6 replies
7 kudos

Resolved! Getting data from the Spark query profiler

When you navigate to Compute > Select Cluster > Spark UI > JDBC/ODBC There you can see grids of Session stats and SQL stats. Is there any way to get this data in a query so that I can do some analysis? Thanks

Data Engineering

385 Views
6 replies
7 kudos

a week ago

View Replies

Latest Reply

szymon_dybczak
Esteemed Contributor III

Tuesday

7 kudos

Hi @IONA ,As @BigRoux correctly suggested there no native way to get stats from JDBC/ODBC Spark UI.1. You can try to use query history system table, but it has limited number of metrics %sql SELECT * FROM system.query.history 2. You can use /api/2....

7 kudos

Tuesday

5 More Replies

by LeoGaller • New Contributor II

04-29-2024 11:55:20 AM

7539 Views
4 replies
4 kudos

What are the options for "spark_conf.spark.databricks.cluster.profile"?

Hey guys, I'm trying to find what are the options we can pass to spark_conf.spark.databricks.cluster.profileI know looking around that some of the available configs are singleNode and serverless, but there are others?Where is the documentation of it?...

Data Engineering

7539 Views
4 replies
4 kudos

04-29-2024 11:55:20 AM

View Replies

Latest Reply

s3
New Contributor II

Wednesday

4 kudos

Recently I got stuck with the same issue. However, in the new view of the form/template to create a policy, you have and option to delete the setting "spark_conf.spark.databricks.cluster.profile" by clicking on the "bin" icon. Once you did that, you ...

4 kudos

Wednesday

3 More Replies

by Yulei • New Contributor III

02-27-2024 4:09:35 PM

29298 Views
7 replies
1 kudos

Resolved! Could not reach driver of cluster

Hi, Rencently, I am seeing issue Could not reach driver of cluster <some_id> with my structure streaming job when migrating to unity catalog and found this when checking the traceback:Traceback (most recent call last):File "/databricks/python_shell/...

Data Engineering

29298 Views
7 replies
1 kudos

02-27-2024 4:09:35 PM

View Replies

Latest Reply

omsingh
New Contributor II

Wednesday

1 kudos

It seems like a temporary connectivity or cluster initialization glitch. So if anyone else runs into this, try re-running the job before diving into deeper troubleshooting - it might just work!Hope this helps someone save time.

1 kudos

Wednesday

6 More Replies

by ChristianRRL • Valued Contributor III

Tuesday

101 Views
1 replies
0 kudos

Can schemaHints dynamically handle nested json structures? (Part 2)

Hi there, I'd like to follow up on a prior post:https://community.databricks.com/t5/data-engineering/can-schemahints-dynamically-handle-nested-json-structures/m-p/130209/highlight/true#M48731Basically I'm wondering what's the best way to set *both* d...

Data Engineering

101 Views
1 replies
0 kudos

Tuesday

View Replies

Latest Reply

-werners-
Esteemed Contributor III

Tuesday

0 kudos

I am not aware on schemahints supporting wildcards for now. It would be awesome to have though, I agree.So I think you are stuck with what is already proposed in your previous post, or exploding the json or other transformations.

0 kudos

Tuesday

by minhhung0507 • Valued Contributor

Tuesday

80 Views
1 replies
1 kudos

Could not reach driver of cluster

I am running a pipeline job in Databricks and it failed with the following message:Run failed with error message Could not reach driver of cluster 5824-145411-p65jt7uo. This message is not very descriptive, and I am not able to identify the root ca...

Data Engineering

80 Views
1 replies
1 kudos

Tuesday

View Replies

Latest Reply

szymon_dybczak
Esteemed Contributor III

Tuesday

1 kudos

Hi @minhhung0507 ,Typically this error could appear when there's a high load on the driver node. Another reason could be related to high garbage collection on driver node as well as high memory and cpu which leads to throttling, and prevents the driv...

1 kudos

Tuesday

by elgeo • Valued Contributor II

10-31-2022 5:46:17 AM

6012 Views
7 replies
8 kudos

Clean up _delta_log files

Hello experts. We are trying to clarify how to clean up the large amount of files that are being accumulated in the _delta_log folder (json, crc and checkpoint files). We went through the related posts in the forum and followed the below:SET spark.da...

Data Engineering

6012 Views
7 replies
8 kudos

10-31-2022 5:46:17 AM

View Replies

Latest Reply

michaeljac1986
New Contributor

Tuesday

8 kudos

What you’re seeing is expected behavior — the _delta_log folder always keeps a history of JSON commit files, checkpoint files, and CRCs. Even if you lower delta.logRetentionDuration and run VACUUM, cleanup won’t happen immediately. A couple of points...

8 kudos

Tuesday

6 More Replies

by erigaud • Honored Contributor

07-17-2023 7:32:15 AM

9609 Views
7 replies
6 kudos

Resolved! SFTP Autoloader

Hello, Don't know if it is possible, but I am wondering if it is possible to ingest files from a SFTP server using autoloader ? Or do I have to first copy the files to my dbfs and then use autoloader on that location ? Thank you !

Data Engineering

9609 Views
7 replies
6 kudos

07-17-2023 7:32:15 AM

View Replies

Latest Reply

Anonymous
Not applicable

07-17-2023 8:51:35 PM

6 kudos

Hi @erigaud We haven't heard from you since the last response from, @BriceBuso and I was checking back to see if her suggestions helped you. Or else, If you have any solution, please share it with the community, as it can be helpful to others. Al...

6 kudos

07-17-2023 8:51:35 PM

6 More Replies

by chiruinfo5262 • New Contributor II

05-26-2025 8:04:23 AM

519 Views
4 replies
0 kudos

Trying to convert oracle sql to databricks sql but not getting the desired output

ORACLE SQL: COUNT( CASE WHEN TRUNC(WORKORDER.REPORTDATE) BETWEEN SELECTED_PERIOD_START_DATE AND SELECTED_PERIOD_END_DATE THEN 1 END ) SELECTED_PERIOD_BM,COUNT( CASE WHEN TRUNC(WORKORDER.REPORTDATE) BETWEEN COMPARISON_PERIOD_START_DATE AND COMPARISON_...

Data Engineering

519 Views
4 replies
0 kudos

05-26-2025 8:04:23 AM

View Replies

Latest Reply

Granty
New Contributor

Tuesday

0 kudos

This is a helpful comparison! I've definitely run into similar date formatting issues when migrating queries. The Oracle TRUNC function and Databricks' DATE_FORMAT/CAST combo can be tricky to reconcile. Speaking of needing a break after debugging SQL...

0 kudos

Tuesday

3 More Replies

by james_ • New Contributor II

a week ago

224 Views
5 replies
0 kudos

Low worker utilisation in Spatial SQL

I am finding low worker node utilization when using Spatial SQL features. My cluster is DBR 17.1 with 2x workers and photon enabled.When I view the cluster metrics, they consistently show one worker around 30-50% utilized, the driver around 15-20%, a...

Data Engineering

224 Views
5 replies
0 kudos

a week ago

View Replies

Latest Reply

james_
New Contributor II

Tuesday

0 kudos

Thank you again, @-werners- . I have a lot still to learn about partitioning and managing spatial data. Perhaps I mainly need more patience!

0 kudos

Tuesday

4 More Replies

Databricks Community

Forum Posts

How does Databricks read data from an offline CDH environment?

Using Databricks to transform cloudera lakehouse on-prem without bringing the data to cloud

How can I structure pipeline-specific job params separately in Databricks Asset Bundle.

Resolved! assets bundle

Incremental refresh of materialized view in serverless DLT

Git integration inconsistencies between git folders and job git

Resolved! Getting data from the Spark query profiler

What are the options for "spark_conf.spark.databricks.cluster.profile"?

Resolved! Could not reach driver of cluster

Can schemaHints dynamically handle nested json structures? (Part 2)

Could not reach driver of cluster

Clean up _delta_log files

Resolved! SFTP Autoloader

Trying to convert oracle sql to databricks sql but not getting the desired output

Low worker utilisation in Spatial SQL

Join Us as a Local Community Builder!

Exposing Data for Consumers in non-UC ADB

Unable to retrieve all rows of delta table using S...

Does the free version of Databricks not support ex...

Unreliable file events on Azure Storage (SFTP) for...

Read Files from Adobe and Push to Delta table ADLS...