cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 
Data + AI Summit 2024 - Data Engineering & Streaming

Forum Posts

brickster_2018
by Esteemed Contributor
  • 1870 Views
  • 1 replies
  • 0 kudos

Resolved! Why do I see my job marked as failed on the Databricks Jobs UI, even though it completed the operations in the application

I have a jar job running migrated from EMR to Databricks. The job runs as expected and completes all the operations in the application. However the job run is marked as failed on the Databricks Jobs UI.

  • 1870 Views
  • 1 replies
  • 0 kudos
Latest Reply
brickster_2018
Esteemed Contributor
  • 0 kudos

Usage of spark.stop(), sc.stop() , System.exit() in your application can cause this behavior. Databricks manages the context shutdown on its own. Forcefully closing it can cause this abrupt behavior.

  • 0 kudos
brickster_2018
by Esteemed Contributor
  • 940 Views
  • 1 replies
  • 2 kudos

Few things you should not do in Databricks!

Few things you should not do in Databricks!

  • 940 Views
  • 1 replies
  • 2 kudos
Latest Reply
brickster_2018
Esteemed Contributor
  • 2 kudos

Compared to OSS Spark, these are few things the users don't have to worry about when running the same job on Databricks. Memory management: Databricks use an internal formula to allocate the Driver and executor heap based on the size of the instance....

  • 2 kudos
brickster_2018
by Esteemed Contributor
  • 2206 Views
  • 1 replies
  • 0 kudos
  • 2206 Views
  • 1 replies
  • 0 kudos
Latest Reply
brickster_2018
Esteemed Contributor
  • 0 kudos

Although not a hard limit, it's recommended to keep the number of cells in the notebook less than 100 for better UI experience as well as code readability. Having a really large block of code in a cell defeats the purpose of notebook execution and al...

  • 0 kudos
brickster_2018
by Esteemed Contributor
  • 17664 Views
  • 1 replies
  • 0 kudos
  • 17664 Views
  • 1 replies
  • 0 kudos
Latest Reply
brickster_2018
Esteemed Contributor
  • 0 kudos

Yes, it's possible to download files from DBFS. To download the filesFiles stored in /FileStore are accessible in your web browser at https://<databricks-instance-name>.cloud.databricks.com/files/. For example, the file you stored in /FileStore/my-da...

  • 0 kudos
User16783853501
by New Contributor II
  • 1061 Views
  • 2 replies
  • 0 kudos

What is the best way to convert a very large parquet table to delta ? possibly without downtime!

What is the best way to convert a very large parquet table to delta ? possibly without downtime! 

  • 1061 Views
  • 2 replies
  • 0 kudos
Latest Reply
brickster_2018
Esteemed Contributor
  • 0 kudos

I vouch for Sajith's answer. The main advantage with "CONVERT TO DELTA" is that operations are metadata centric which means we are not reading the full data for the conversion. For any other file format conversion, it's necessary to read the data com...

  • 0 kudos
1 More Replies
brickster_2018
by Esteemed Contributor
  • 1240 Views
  • 2 replies
  • 0 kudos

Why should I move to Auto-loader?

I have a streaming workload using the S3-SQS Connector. The streaming job is running fine within the SLA. Should I migrate my job to use the auto-loader? If Yes, what are the benefits? who should migrate and who should not?

  • 1240 Views
  • 2 replies
  • 0 kudos
Latest Reply
brickster_2018
Esteemed Contributor
  • 0 kudos

That makes sense @Anand Ladda​ ! One major improvement that will have a direct impact on the performance is the architectural difference. S3-SQS uses an internal implementation of the Delta table to store the checkpoint details about the source files...

  • 0 kudos
1 More Replies
User16783853501
by New Contributor II
  • 1912 Views
  • 3 replies
  • 0 kudos

best practice for optimizedWrites and Optimize

What is the best practice for a delta pipeline with very high throughput to avoid small files problem and also reduce the need for external OPTIMIZE frequently?  

  • 1912 Views
  • 3 replies
  • 0 kudos
Latest Reply
brickster_2018
Esteemed Contributor
  • 0 kudos

The general practice in use is to enable only optimize writes and disable auto-compaction. This is because the optimize writes will introduce an extra shuffle step which will increase the latency of the write operation. In addition to that, the auto-...

  • 0 kudos
2 More Replies
aladda
by Honored Contributor II
  • 1901 Views
  • 1 replies
  • 0 kudos
  • 1901 Views
  • 1 replies
  • 0 kudos
Latest Reply
aladda
Honored Contributor II
  • 0 kudos

Stats collected on a Delta column are either using for Partitioning Pruning, Data Skipping. See here - https://docs.databricks.com/delta/optimizations/file-mgmt.html#delta-data-skipping for detailsIn additional stats are also used for Metadata only q...

  • 0 kudos
aladda
by Honored Contributor II
  • 1135 Views
  • 0 replies
  • 0 kudos

What are the recommendations around collecting stats on long strings in a Delta Table

It is best to avoid collecting stats on long strings. You typically want to collect stats on column that are used in filter, where clauses, joins and on which you tend to performance aggregations - typically numerical valuesYou can avoid collecting s...

  • 1135 Views
  • 0 replies
  • 0 kudos
User16783853501
by New Contributor II
  • 1085 Views
  • 2 replies
  • 0 kudos

Delta Optimistic Transactions Resolution and Exceptions

What is the best way to deal with concurrent exceptions in Delta when you have multiple writers on the same delta table ?

  • 1085 Views
  • 2 replies
  • 0 kudos
Latest Reply
sajith_appukutt
Honored Contributor II
  • 0 kudos

While you can try-catch-retry , it would be expensive to retry as the underlying table snapshot would have changed. So the best approach is to avoid conflicts using partitioning and disjoint command conditions as much as possible.

  • 0 kudos
1 More Replies
aladda
by Honored Contributor II
  • 4341 Views
  • 1 replies
  • 0 kudos
  • 4341 Views
  • 1 replies
  • 0 kudos
Latest Reply
aladda
Honored Contributor II
  • 0 kudos

by default a delta table has stats collected on the first 32 columns. This setting can be configured using the following.set spark.databricks.delta.properties.defaults.dataSkippingNumIndexedCols = 3However there's a time trade-off to having a large n...

  • 0 kudos
aladda
by Honored Contributor II
  • 840 Views
  • 1 replies
  • 0 kudos
  • 840 Views
  • 1 replies
  • 0 kudos
Latest Reply
aladda
Honored Contributor II
  • 0 kudos

Its typically a good idea to run optimize aligned with the frequency of updates to the Delta Table. However you also don't want to over do as there's a cost/performance trade-off. Unless there are very frequent updates to the table that can cause sma...

  • 0 kudos
aladda
by Honored Contributor II
  • 1046 Views
  • 1 replies
  • 0 kudos
  • 1046 Views
  • 1 replies
  • 0 kudos
Latest Reply
aladda
Honored Contributor II
  • 0 kudos

Optimize merges small files into larger ones and can involve shuffling and creation of large in-memory partitions. Thus its recommended to use a memory optimized executor configuration to prevent spilling to disk. IN additional use of autoscaling wil...

  • 0 kudos
aladda
by Honored Contributor II
  • 1050 Views
  • 1 replies
  • 0 kudos
  • 1050 Views
  • 1 replies
  • 0 kudos
Latest Reply
aladda
Honored Contributor II
  • 0 kudos

Z-ordering is generally effective on up to 3-4 columns and New clustering algorithm in DBR 7.6 can even go upto 5 columns. However, the key is to Z-order on columns that are typically used in filters/where predicates and joins.

  • 0 kudos

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group
Labels