Data Engineering

Forum Posts

Sorted by:

by User16869510359 • Esteemed Contributor

06-25-2021 12:24:56 PM

675 Views
1 replies
0 kudos

Resolved! Delta Streaming and Optimize

I have a master delta table that is continuously getting written by a streaming job. I have optimize writes enabled and in addition, I run the OPTIMIZE command every 3 hours. However, I think the downstream streaming jobs which are streaming the data...

Data Engineering

675 Views
1 replies
0 kudos

06-25-2021 12:24:56 PM

View Replies

Latest Reply

User16869510359
Esteemed Contributor

06-25-2021 12:26:24 PM

0 kudos

This is working as expected. For Delta streaming, the data files created in the first place will be used for streaming. The optimized files are not considered the downstream streaming job. This is the reason it's not recommended to run VACUUM with f...

0 kudos

06-25-2021 12:26:24 PM

by User16826990884 • New Contributor III

06-25-2021 11:54:35 AM

1450 Views
1 replies
1 kudos

Impact on Databricks objects after a user is deleted

What happens to resources (notebooks, jobs, clusters etc.) owned by a user when a user is deleted? The underlying problem we are trying to solve is that we want to automatically delete users through SCIM when the user leaves the company so that the u...

Data Engineering

1450 Views
1 replies
1 kudos

06-25-2021 11:54:35 AM

View Replies

Latest Reply

sajith_appukutt
Honored Contributor II

06-25-2021 12:25:01 PM

1 kudos

When you remove a user from Databricks, a special backup folder is created in the workspace. This backup folder contains all of the deleted user’s content.W.r.t clusters/jobs, an admin can grant permission to other users.

1 kudos

06-25-2021 12:25:01 PM

by User16869510359 • Esteemed Contributor

06-25-2021 12:17:59 PM

1364 Views
1 replies
1 kudos

Resolved! How to run commands on the executor

Using %sh, I am able to run commands on the notebook and get output. How can i run a command on the executor and get the output. I want to avoid using the Spark API's

Data Engineering

1364 Views
1 replies
1 kudos

06-25-2021 12:17:59 PM

View Replies

Latest Reply

User16869510359
Esteemed Contributor

06-25-2021 12:19:42 PM

1 kudos

It's not possible to use %sh to run commands on the executor. The below code can be used to run commands on the executor and get the outputvar res=sc.runOnEachExecutor[String]({ () => import sys.process._ var cmd_Result=Seq("bash", "-c", "h...

1 kudos

06-25-2021 12:19:42 PM

by User16826990884 • New Contributor III

06-25-2021 11:40:31 AM

10497 Views
1 replies
1 kudos

Resolved! Views vs Materialized Delta Tables

Is there general guidance around using views vs creating Delta tables? For example, I need to do some filtering and make small tweaks to a few columns for use in another application. Is there a downside of using a view here?

Data Engineering

10497 Views
1 replies
1 kudos

06-25-2021 11:40:31 AM

View Replies

Latest Reply

User16826990884
New Contributor III

06-25-2021 12:18:16 PM

1 kudos

Views won't duplicate the data so if you are just filtering columns or rows or making small tweaks then views might be a good option. Unless, of course, the filtering is really expensive or you are doing a lot of calculations, then materialize the vi...

1 kudos

06-25-2021 12:18:16 PM

by Srikanth_Gupta_ • Valued Contributor

06-25-2021 7:33:46 AM

693 Views
1 replies
1 kudos

What is the difference between Databricks secret scopes vs AWS secret manager vs Azure key vault, in which scenarios I should go for secret scopes

Data Engineering

693 Views
1 replies
1 kudos

06-25-2021 7:33:46 AM

View Replies

Latest Reply

sajith_appukutt
Honored Contributor II

06-25-2021 12:15:32 PM

1 kudos

All three options are secure ways to store secrets. Databricks secrets has the additional functionality of redaction , so it is convenient sometimes. Also in azure, you have the ability to use azure KV as the backend for secrets.

1 kudos

06-25-2021 12:15:32 PM

by User16869510359 • Esteemed Contributor

06-25-2021 12:13:22 PM

2180 Views
1 replies
1 kudos

Resolved! Classpath issues when running spark-submit

How to identify the jars used to load a particular class. I am sure I packed the classes correctly in my application jar. However, looks like the class is loaded from a different jar. I want to understand the details so that I can ensure to use the r...

Data Engineering

2180 Views
1 replies
1 kudos

06-25-2021 12:13:22 PM

View Replies

Latest Reply

User16869510359
Esteemed Contributor

06-25-2021 12:14:49 PM

1 kudos

Adding the below configurations at the cluster level can help to print more logs to identify the jars from which the class is loaded. spark.executor.extraJavaOptions=-verbose:class spark.driver.extraJavaOptions=-verbose:class

1 kudos

06-25-2021 12:14:49 PM

by User16869510359 • Esteemed Contributor

06-25-2021 12:08:37 PM

700 Views
1 replies
0 kudos

Resolved! Cannot upload libraries on UI

When trying to upload libraries on UI it fails.

Data Engineering

700 Views
1 replies
0 kudos

06-25-2021 12:08:37 PM

View Replies

Latest Reply

User16869510359
Esteemed Contributor

06-25-2021 12:10:23 PM

0 kudos

One corner case scenario where we can hit this issue is if there is /<shard name>/0/Filestore/jars file in the root bucket of the workspace. Once you remove the file, the upload should work fine.

0 kudos

06-25-2021 12:10:23 PM

by User16783853906 • Contributor III

06-07-2021 12:05:03 PM

1426 Views
3 replies
0 kudos

Resolved! Frequent spot loss of driver nodes resulting in failed jobs when using spot fleet pools

When using spot fleet pools to schedule jobs, driver and worker nodes are provisioned from the spot pools and we are noticing jobs failing with the below exception when there is a driver spot loss. Share best practices around using fleet pools with 1...

Data Engineering

1426 Views
3 replies
0 kudos

06-07-2021 12:05:03 PM

View Replies

Latest Reply

User16783853906
Contributor III

06-23-2021 2:20:55 PM

0 kudos

In this scenario, the driver node is reclaimed by AWS. Databricks started preview of hybrid pools feature which would allow you to provision driver node from a different pool. We recommend using on-demand pool for driver node to improve reliability i...

0 kudos

06-23-2021 2:20:55 PM

2 More Replies

by User16869510359 • Esteemed Contributor

06-25-2021 12:06:06 PM

811 Views
1 replies
1 kudos

Resolved! Databricks Vs Yarn - Resource Utilization

I have a spark-submit application that worked fine with 8GB executor memory in yarn. I am testing the same job against the Databricks cluster with the same executor memory. However, the jobs are running slower in Databricks.

Data Engineering

811 Views
1 replies
1 kudos

06-25-2021 12:06:06 PM

View Replies

Latest Reply

User16869510359
Esteemed Contributor

06-25-2021 12:06:46 PM

1 kudos

This is not an Apple to Apple comparison. When you set 8GB as the executor memory in Yarn, then the container that is launched to run the executor JVM is getting 8GB of memory. Accordingly, the Xmx value of the heap is calculated. In Databricks, when...

1 kudos

06-25-2021 12:06:46 PM

by Anonymous • Not applicable

06-24-2021 10:06:17 PM

1571 Views
2 replies
0 kudos

How can I disable downloading files such as csv files?

Data Engineering

1571 Views
2 replies
0 kudos

06-24-2021 10:06:17 PM

View Replies

Latest Reply

sajith_appukutt
Honored Contributor II

06-25-2021 12:05:56 PM

0 kudos

You can disable download button for notebook results which exports results as csv from admin console - > workspace settings -> advanced section

0 kudos

06-25-2021 12:05:56 PM

1 More Replies

by Srikanth_Gupta_ • Valued Contributor

06-16-2021 10:09:43 AM

964 Views
2 replies
0 kudos

Can we create pools to reduce cluster start time in Databricks

Data Engineering

964 Views
2 replies
0 kudos

06-16-2021 10:09:43 AM

View Replies

Latest Reply

User16783853906
Contributor III

06-25-2021 12:05:44 PM

0 kudos

When a cluster is attached to a pool, cluster nodes are created using the the pool’s idle instances which help to reduce cluster start and auto-scaling times .If you are using pools and looking to reduce start time for all scenarios, then you should ...

0 kudos

06-25-2021 12:05:44 PM

1 More Replies

by Anonymous • Not applicable

06-24-2021 10:12:44 PM

701 Views
2 replies
0 kudos

Is Autoloader an option to load data to Databricks from Azure SQL?

Data Engineering

701 Views
2 replies
0 kudos

06-24-2021 10:12:44 PM

View Replies

Latest Reply

sajith_appukutt
Honored Contributor II

06-25-2021 12:01:55 PM

0 kudos

If you are looking for incrementally loading data from Azure SQL, checkout one of our technology partners that support change-data-capture or setup debezium for sql-server. These solutions could land data in a streaming fashion to kafka/kinesis/even...

0 kudos

06-25-2021 12:01:55 PM

1 More Replies

by User16869510359 • Esteemed Contributor

06-25-2021 11:56:55 AM

1031 Views
1 replies
0 kudos

Resolved! Can I use OSS Spark History Server to view the EventLogs

Is it possible to run the OSS SPark history server and view the spark event logs.

Data Engineering

1031 Views
1 replies
0 kudos

06-25-2021 11:56:55 AM

View Replies

Latest Reply

User16869510359
Esteemed Contributor

06-25-2021 11:58:12 AM

0 kudos

Yes, it's possible. The OSS Spark history server can read the Spark event logs generated on a Databricks cluster. Using Cluster log delivery, the SPark logs can be written to any arbitrary location. Event logs can be copied from there to the storage ...

0 kudos

06-25-2021 11:58:12 AM

by User16826990884 • New Contributor III

06-25-2021 11:56:21 AM

429 Views
0 replies
0 kudos

Encrypt root S3 bucket

This is a 2-part question:How do I go about encrypting an existing root S3 bucket?Will this impact my Databricks environment? (Resources not being accessible, performance issues etc.)

Data Engineering

429 Views
0 replies
0 kudos

06-25-2021 11:56:21 AM

by User16869510359 • Esteemed Contributor

06-25-2021 11:55:32 AM

1407 Views
1 replies
0 kudos

Resolved! Jobs running forever in Spark UI

On the Spark UI, Jobs are running forever. But my notebook already completed the operations. Why the resources are wasted

Data Engineering

1407 Views
1 replies
0 kudos

06-25-2021 11:55:32 AM

View Replies

Latest Reply

User16869510359
Esteemed Contributor

06-25-2021 11:55:51 AM

0 kudos

This happens if the Spark driver is missing events. The jobs/task are not running. The Spark UI is reporting incorrect stats. This can be treated as a harmless UI issue. If you continue to see the issue consistently, then it might be good to review w...

0 kudos

06-25-2021 11:55:51 AM

User

Count

1601

736

343

284

247

Databricks

Forum Posts

Resolved! Delta Streaming and Optimize

Impact on Databricks objects after a user is deleted

Resolved! How to run commands on the executor

Resolved! Views vs Materialized Delta Tables

What is the difference between Databricks secret scopes vs AWS secret manager vs Azure key vault, in which scenarios I should go for secret scopes

Resolved! Classpath issues when running spark-submit

Resolved! Cannot upload libraries on UI

Resolved! Frequent spot loss of driver nodes resulting in failed jobs when using spot fleet pools

Resolved! Databricks Vs Yarn - Resource Utilization

How can I disable downloading files such as csv files?

Can we create pools to reduce cluster start time in Databricks

Is Autoloader an option to load data to Databricks from Azure SQL?

Resolved! Can I use OSS Spark History Server to view the EventLogs

Encrypt root S3 bucket

Resolved! Jobs running forever in Spark UI

DELTA_EXCEED_CHAR_VARCHAR_LIMIT

Not able to set run_as service_principal_name

Pyspark operations slowness in CLuster 14.3LTS as ...

[Databricks Assets Bundles] Workflow trigger on fi...

Addressing Pipeline Error Handling in Databricks b...