cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Forum Posts

User16783853501
by Databricks Employee
  • 2002 Views
  • 2 replies
  • 0 kudos

What is the best way to convert a very large parquet table to delta ? possibly without downtime!

What is the best way to convert a very large parquet table to delta ? possibly without downtime! 

  • 2002 Views
  • 2 replies
  • 0 kudos
Latest Reply
brickster_2018
Databricks Employee
  • 0 kudos

I vouch for Sajith's answer. The main advantage with "CONVERT TO DELTA" is that operations are metadata centric which means we are not reading the full data for the conversion. For any other file format conversion, it's necessary to read the data com...

  • 0 kudos
1 More Replies
brickster_2018
by Databricks Employee
  • 2151 Views
  • 2 replies
  • 0 kudos

Why should I move to Auto-loader?

I have a streaming workload using the S3-SQS Connector. The streaming job is running fine within the SLA. Should I migrate my job to use the auto-loader? If Yes, what are the benefits? who should migrate and who should not?

  • 2151 Views
  • 2 replies
  • 0 kudos
Latest Reply
brickster_2018
Databricks Employee
  • 0 kudos

That makes sense @Anand Ladda​ ! One major improvement that will have a direct impact on the performance is the architectural difference. S3-SQS uses an internal implementation of the Delta table to store the checkpoint details about the source files...

  • 0 kudos
1 More Replies
aladda
by Databricks Employee
  • 3580 Views
  • 1 replies
  • 0 kudos
  • 3580 Views
  • 1 replies
  • 0 kudos
Latest Reply
aladda
Databricks Employee
  • 0 kudos

Stats collected on a Delta column are either using for Partitioning Pruning, Data Skipping. See here - https://docs.databricks.com/delta/optimizations/file-mgmt.html#delta-data-skipping for detailsIn additional stats are also used for Metadata only q...

  • 0 kudos
User16783853501
by Databricks Employee
  • 1987 Views
  • 2 replies
  • 0 kudos

Delta Optimistic Transactions Resolution and Exceptions

What is the best way to deal with concurrent exceptions in Delta when you have multiple writers on the same delta table ?

  • 1987 Views
  • 2 replies
  • 0 kudos
Latest Reply
sajith_appukutt
Databricks Employee
  • 0 kudos

While you can try-catch-retry , it would be expensive to retry as the underlying table snapshot would have changed. So the best approach is to avoid conflicts using partitioning and disjoint command conditions as much as possible.

  • 0 kudos
1 More Replies
aladda
by Databricks Employee
  • 6772 Views
  • 1 replies
  • 0 kudos
  • 6772 Views
  • 1 replies
  • 0 kudos
Latest Reply
aladda
Databricks Employee
  • 0 kudos

by default a delta table has stats collected on the first 32 columns. This setting can be configured using the following.set spark.databricks.delta.properties.defaults.dataSkippingNumIndexedCols = 3However there's a time trade-off to having a large n...

  • 0 kudos
aladda
by Databricks Employee
  • 1379 Views
  • 1 replies
  • 0 kudos
  • 1379 Views
  • 1 replies
  • 0 kudos
Latest Reply
aladda
Databricks Employee
  • 0 kudos

Its typically a good idea to run optimize aligned with the frequency of updates to the Delta Table. However you also don't want to over do as there's a cost/performance trade-off. Unless there are very frequent updates to the table that can cause sma...

  • 0 kudos
aladda
by Databricks Employee
  • 1845 Views
  • 1 replies
  • 0 kudos
  • 1845 Views
  • 1 replies
  • 0 kudos
Latest Reply
aladda
Databricks Employee
  • 0 kudos

Optimize merges small files into larger ones and can involve shuffling and creation of large in-memory partitions. Thus its recommended to use a memory optimized executor configuration to prevent spilling to disk. IN additional use of autoscaling wil...

  • 0 kudos
aladda
by Databricks Employee
  • 2125 Views
  • 1 replies
  • 0 kudos
  • 2125 Views
  • 1 replies
  • 0 kudos
Latest Reply
aladda
Databricks Employee
  • 0 kudos

Z-ordering is generally effective on up to 3-4 columns and New clustering algorithm in DBR 7.6 can even go upto 5 columns. However, the key is to Z-order on columns that are typically used in filters/where predicates and joins.

  • 0 kudos
aladda
by Databricks Employee
  • 2268 Views
  • 1 replies
  • 0 kudos
  • 2268 Views
  • 1 replies
  • 0 kudos
Latest Reply
aladda
Databricks Employee
  • 0 kudos

This is typically caused by not have SSO enabled on the token with your Git Provider. If you have SSO, you need to authorize your token for the same

  • 0 kudos
aladda
by Databricks Employee
  • 4381 Views
  • 1 replies
  • 0 kudos
  • 4381 Views
  • 1 replies
  • 0 kudos
Latest Reply
aladda
Databricks Employee
  • 0 kudos

gzip format is not splittable so the load process is sequential and thus slower. You can either try to split the CSV up into parts, gzip those separately and load them. Alternatively bzip is a splittable zip format that is better to work withOr you c...

  • 0 kudos
aladda
by Databricks Employee
  • 1840 Views
  • 1 replies
  • 0 kudos
  • 1840 Views
  • 1 replies
  • 0 kudos
Latest Reply
aladda
Databricks Employee
  • 0 kudos

Courtesy of my colleague Sri, here's some sample library code to execute on a databricks cluster with a short SLAimport logging import textwrap import time from typing import Text from databricks_cli.sdk import ApiClient, ClusterService # Create a cu...

  • 0 kudos
User16783853501
by Databricks Employee
  • 2502 Views
  • 2 replies
  • 1 kudos

using Spark SQL or particularly %SQL in a databricks notebook, is there a way to use pagination or offset or skip ?

using Spark SQL or particularly %SQL in a databricks notebook, is there a way to use pagination or offset or skip ? 

  • 2502 Views
  • 2 replies
  • 1 kudos
Latest Reply
sajith_appukutt
Databricks Employee
  • 1 kudos

There is no offset support yet. Here are a few possible workarounds If you data is all in one partition ( rarely the case ) , you could create a column with monotonically_increasing_id and apply filter conditions. if there are multiple partitions...

  • 1 kudos
1 More Replies
aladda
by Databricks Employee
  • 1867 Views
  • 1 replies
  • 0 kudos
  • 1867 Views
  • 1 replies
  • 0 kudos
Latest Reply
aladda
Databricks Employee
  • 0 kudos

Delta Live Table supports the data quality checks via expectations. On encountering invalid records you can choose to either retain them, drop them or fail/stop the pipeline. See the link below for additional detailshttps://docs.databricks.com/data-e...

  • 0 kudos

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!

Sign Up Now
Labels