Data Engineering

Forum Posts

Sorted by:

by aladda • Honored Contributor II

06-23-2021 9:25:48 PM

659 Views
0 replies
0 kudos

What are the recommendations around collecting stats on long strings in a Delta Table

It is best to avoid collecting stats on long strings. You typically want to collect stats on column that are used in filter, where clauses, joins and on which you tend to performance aggregations - typically numerical valuesYou can avoid collecting s...

Data Engineering

659 Views
0 replies
0 kudos

06-23-2021 9:25:48 PM

by User16783853501 • New Contributor II

06-23-2021 6:55:35 PM

741 Views
2 replies
0 kudos

Delta Optimistic Transactions Resolution and Exceptions

What is the best way to deal with concurrent exceptions in Delta when you have multiple writers on the same delta table ?

Data Engineering

741 Views
2 replies
0 kudos

06-23-2021 6:55:35 PM

View Replies

Latest Reply

sajith_appukutt
Honored Contributor II

06-23-2021 9:22:18 PM

0 kudos

While you can try-catch-retry , it would be expensive to retry as the underlying table snapshot would have changed. So the best approach is to avoid conflicts using partitioning and disjoint command conditions as much as possible.

0 kudos

06-23-2021 9:22:18 PM

1 More Replies

by aladda • Honored Contributor II

06-23-2021 9:19:52 PM

3034 Views
1 replies
0 kudos

Resolved! How many columns does Delta Engine collect stats on for a Delta Table

Data Engineering

3034 Views
1 replies
0 kudos

06-23-2021 9:19:52 PM

View Replies

Latest Reply

aladda
Honored Contributor II

06-23-2021 9:22:03 PM

0 kudos

by default a delta table has stats collected on the first 32 columns. This setting can be configured using the following.set spark.databricks.delta.properties.defaults.dataSkippingNumIndexedCols = 3However there's a time trade-off to having a large n...

0 kudos

06-23-2021 9:22:03 PM

by aladda • Honored Contributor II

06-23-2021 9:16:35 PM

636 Views
1 replies
0 kudos

How frequently should Optimize be run on a Delta Table

Data Engineering

636 Views
1 replies
0 kudos

06-23-2021 9:16:35 PM

View Replies

Latest Reply

aladda
Honored Contributor II

06-23-2021 9:18:00 PM

0 kudos

Its typically a good idea to run optimize aligned with the frequency of updates to the Delta Table. However you also don't want to over do as there's a cost/performance trade-off. Unless there are very frequent updates to the table that can cause sma...

0 kudos

06-23-2021 9:18:00 PM

by aladda • Honored Contributor II

06-23-2021 9:13:35 PM

668 Views
1 replies
0 kudos

Resolved! What type of cluster configuration should one use to run Optimize on a Delta Table

Data Engineering

668 Views
1 replies
0 kudos

06-23-2021 9:13:35 PM

View Replies

Latest Reply

aladda
Honored Contributor II

06-23-2021 9:15:49 PM

0 kudos

Optimize merges small files into larger ones and can involve shuffling and creation of large in-memory partitions. Thus its recommended to use a memory optimized executor configuration to prevent spilling to disk. IN additional use of autoscaling wil...

0 kudos

06-23-2021 9:15:49 PM

by aladda • Honored Contributor II

06-23-2021 9:11:10 PM

687 Views
1 replies
0 kudos

Resolved! What's the recommended number of columns you Z-order a Delta table by

Data Engineering

687 Views
1 replies
0 kudos

06-23-2021 9:11:10 PM

View Replies

Latest Reply

aladda
Honored Contributor II

06-23-2021 9:12:46 PM

0 kudos

Z-ordering is generally effective on up to 3-4 columns and New clustering algorithm in DBR 7.6 can even go upto 5 columns. However, the key is to Z-order on columns that are typically used in filters/where predicates and joins.

0 kudos

06-23-2021 9:12:46 PM

by aladda • Honored Contributor II

06-23-2021 9:03:11 PM

524 Views
1 replies
0 kudos

Resolved! Running into a strange issue configuring Repos for Github. followed the steps and generated a PAT token with repo level permissions, and it fails with "Invalid Token message"

Data Engineering

524 Views
1 replies
0 kudos

06-23-2021 9:03:11 PM

View Replies

Latest Reply

aladda
Honored Contributor II

06-23-2021 9:04:28 PM

0 kudos

This is typically caused by not have SSO enabled on the token with your Git Provider. If you have SSO, you need to authorize your token for the same

0 kudos

06-23-2021 9:04:28 PM

by aladda • Honored Contributor II

06-23-2021 8:58:34 PM

1659 Views
1 replies
0 kudos

Resolved! How can I speed up the loading of a large zipped CSV file in databricks

Data Engineering

1659 Views
1 replies
0 kudos

06-23-2021 8:58:34 PM

View Replies

Latest Reply

aladda
Honored Contributor II

06-23-2021 9:01:02 PM

0 kudos

gzip format is not splittable so the load process is sequential and thus slower. You can either try to split the CSV up into parts, gzip those separately and load them. Alternatively bzip is a splittable zip format that is better to work withOr you c...

0 kudos

06-23-2021 9:01:02 PM

by aladda • Honored Contributor II

06-23-2021 8:49:33 PM

708 Views
1 replies
0 kudos

Resolved! Is there a way to execute code on a databricks cluster with a shorter duration than a Notebook which has some fixed startup time

Data Engineering

708 Views
1 replies
0 kudos

06-23-2021 8:49:33 PM

View Replies

Latest Reply

aladda
Honored Contributor II

06-23-2021 8:54:13 PM

0 kudos

Courtesy of my colleague Sri, here's some sample library code to execute on a databricks cluster with a short SLAimport logging import textwrap import time from typing import Text from databricks_cli.sdk import ApiClient, ClusterService # Create a cu...

0 kudos

06-23-2021 8:54:13 PM

by User16783853501 • New Contributor II

06-23-2021 2:49:16 PM

1222 Views
2 replies
1 kudos

using Spark SQL or particularly %SQL in a databricks notebook, is there a way to use pagination or offset or skip ?

Data Engineering

1222 Views
2 replies
1 kudos

06-23-2021 2:49:16 PM

View Replies

Latest Reply

sajith_appukutt
Honored Contributor II

06-23-2021 8:54:09 PM

1 kudos

There is no offset support yet. Here are a few possible workarounds If you data is all in one partition ( rarely the case ) , you could create a column with monotonically_increasing_id and apply filter conditions. if there are multiple partitions...

1 kudos

06-23-2021 8:54:09 PM

1 More Replies

by Srikanth_Gupta_ • Valued Contributor

06-23-2021 8:50:34 PM

365 Views
0 replies
0 kudos

Can we store failed/errored records of DLT Except violation in a table?

Data Engineering

365 Views
0 replies
0 kudos

06-23-2021 8:50:34 PM

by aladda • Honored Contributor II

06-23-2021 8:40:03 PM

683 Views
1 replies
0 kudos

Resolved! What are the different options for dealing with invalid records in a Delta Live Table

Data Engineering

683 Views
1 replies
0 kudos

06-23-2021 8:40:03 PM

View Replies

Latest Reply

aladda
Honored Contributor II

06-23-2021 8:41:27 PM

0 kudos

Delta Live Table supports the data quality checks via expectations. On encountering invalid records you can choose to either retain them, drop them or fail/stop the pipeline. See the link below for additional detailshttps://docs.databricks.com/data-e...

0 kudos

06-23-2021 8:41:27 PM

by aladda • Honored Contributor II

06-23-2021 8:37:09 PM

1421 Views
1 replies
0 kudos

What is the difference between View and Table in Delta Live Table pipeline

Data Engineering

1421 Views
1 replies
0 kudos

06-23-2021 8:37:09 PM

View Replies

Latest Reply

aladda
Honored Contributor II

06-23-2021 8:38:44 PM

0 kudos

Here's the difference a View and Table in the context of a Delta Live Table PIpelineViews are similar to a temporary view in SQL and are an alias for some computation. A view allows you to break a complicated query into smaller or easier-to-understan...

0 kudos

06-23-2021 8:38:44 PM

by aladda • Honored Contributor II

06-23-2021 8:28:12 PM

754 Views
1 replies
0 kudos

Resolved! Can you publish results of a Delta Live Table pipeline to a database in the metastore?

Data Engineering

754 Views
1 replies
0 kudos

06-23-2021 8:28:12 PM

View Replies

Latest Reply

aladda
Honored Contributor II

06-23-2021 8:29:41 PM

0 kudos

Yes. You can specify a "target" database as part of your DLT pipeline configuration to publish results to a target database in the metastore. See - https://docs.databricks.com/data-engineering/delta-live-tables/delta-live-tables-quickstart.html#publi...

0 kudos

06-23-2021 8:29:41 PM

by aladda • Honored Contributor II

06-23-2021 8:22:31 PM

615 Views
1 replies
0 kudos

Where are the results of a Delta Live Table Pipeline published to?

Data Engineering

615 Views
1 replies
0 kudos

06-23-2021 8:22:31 PM

View Replies

Latest Reply

aladda
Honored Contributor II

06-23-2021 8:26:09 PM

0 kudos

DLT Pipeline results are published to the "Storage Location" defined as part of configuring the Pipeline. Ex:- https://docs.databricks.com/_images/dlt-create-notebook-pipeline.pngIf an explicit Storage Location is not specified, the pipeline results ...

0 kudos

06-23-2021 8:26:09 PM

User

Count

1602

736

343

284

247

Databricks

Forum Posts

What are the recommendations around collecting stats on long strings in a Delta Table

Delta Optimistic Transactions Resolution and Exceptions

Resolved! How many columns does Delta Engine collect stats on for a Delta Table

How frequently should Optimize be run on a Delta Table

Resolved! What type of cluster configuration should one use to run Optimize on a Delta Table

Resolved! What's the recommended number of columns you Z-order a Delta table by

Resolved! Running into a strange issue configuring Repos for Github. followed the steps and generated a PAT token with repo level permissions, and it fails with "Invalid Token message"

Resolved! How can I speed up the loading of a large zipped CSV file in databricks

Resolved! Is there a way to execute code on a databricks cluster with a shorter duration than a Notebook which has some fixed startup time

using Spark SQL or particularly %SQL in a databricks notebook, is there a way to use pagination or offset or skip ?

Can we store failed/errored records of DLT Except violation in a table?

Resolved! What are the different options for dealing with invalid records in a Delta Live Table

What is the difference between View and Table in Delta Live Table pipeline

Resolved! Can you publish results of a Delta Live Table pipeline to a database in the metastore?

Where are the results of a Delta Live Table Pipeline published to?

Best way to parse Google Analytics data in Databri...

DELTA_EXCEED_CHAR_VARCHAR_LIMIT

Not able to set run_as service_principal_name

Pyspark operations slowness in CLuster 14.3LTS as ...

[Databricks Assets Bundles] Workflow trigger on fi...