Data Engineering

Forum Posts

Sorted by:

Start a conversation

by User16869510359 • Esteemed Contributor

06-23-2021 11:33:44 PM

12850 Views
1 replies
0 kudos

Resolved! How do I change the log level in Databricks?

How can I change the log level of the Spark Driver and executor process?

Data Engineering

12850 Views
1 replies
0 kudos

06-23-2021 11:33:44 PM

View Replies

Latest Reply

User16869510359
Esteemed Contributor

06-23-2021 11:47:51 PM

0 kudos

Change the log level of Driver:%scala spark.sparkContext.setLogLevel("DEBUG") spark.sparkContext.setLogLevel("INFO")Change the log level of a particular package in Driver logs:%scala org.apache.log4j.Logger.getLogger("shaded.databricks.v201809...

0 kudos

06-23-2021 11:47:51 PM

by User16869510359 • Esteemed Contributor

06-23-2021 11:32:56 PM

1013 Views
1 replies
0 kudos

Resolved! I do not have any Spark jobs running, but my cluster is not getting auto-terminated.

The cluster is Idle and there are no Spark jobs running on the Spark UI. Still I see my cluster is active and not getting terminated.

Data Engineering

1013 Views
1 replies
0 kudos

06-23-2021 11:32:56 PM

View Replies

Latest Reply

User16869510359
Esteemed Contributor

06-23-2021 11:45:13 PM

0 kudos

Databricks cluster is treated as active if there are any spark or non-Spark operations running on the cluster. Even though there are no Spark jobs running on the cluster, it's possible to have some driver-specific application code running marking th...

0 kudos

06-23-2021 11:45:13 PM

by User16869510359 • Esteemed Contributor

06-23-2021 11:00:17 PM

2015 Views
1 replies
0 kudos

Resolved! How to use Databricks REST API within a notebook without providing tokens

Data Engineering

2015 Views
1 replies
0 kudos

06-23-2021 11:00:17 PM

View Replies

Latest Reply

User16869510359
Esteemed Contributor

06-23-2021 11:20:02 PM

0 kudos

Disclaimer: This code snippet uses an internal API. It's not recommended to use internal API's in your application as they are subject to change or discontinuity. %python import requests API_URL = dbutils.notebook.entry_point.getDbutils().notebook(...

0 kudos

06-23-2021 11:20:02 PM

by User16869510359 • Esteemed Contributor

06-23-2021 11:00:43 PM

576 Views
1 replies
1 kudos

Resolved! Can I turn off the notebook history feature?

Data Engineering

576 Views
1 replies
1 kudos

06-23-2021 11:00:43 PM

View Replies

Latest Reply

User16869510359
Esteemed Contributor

06-23-2021 11:18:18 PM

1 kudos

No, currently it's not configurable. Synching of notebooks to Git repositories is configurable and can be turned off at the workspace level.

1 kudos

06-23-2021 11:18:18 PM

by User16869510359 • Esteemed Contributor

06-23-2021 11:05:57 PM

1369 Views
1 replies
0 kudos

Resolved! Why do I see my job marked as failed on the Databricks Jobs UI, even though it completed the operations in the application

I have a jar job running migrated from EMR to Databricks. The job runs as expected and completes all the operations in the application. However the job run is marked as failed on the Databricks Jobs UI.

Data Engineering

1369 Views
1 replies
0 kudos

06-23-2021 11:05:57 PM

View Replies

Latest Reply

User16869510359
Esteemed Contributor

06-23-2021 11:14:49 PM

0 kudos

Usage of spark.stop(), sc.stop() , System.exit() in your application can cause this behavior. Databricks manages the context shutdown on its own. Forcefully closing it can cause this abrupt behavior.

0 kudos

06-23-2021 11:14:49 PM

by User16869510359 • Esteemed Contributor

06-23-2021 11:07:07 PM

553 Views
1 replies
2 kudos

Few things you should not do in Databricks!

Data Engineering

553 Views
1 replies
2 kudos

06-23-2021 11:07:07 PM

View Replies

Latest Reply

User16869510359
Esteemed Contributor

06-23-2021 11:11:41 PM

2 kudos

Compared to OSS Spark, these are few things the users don't have to worry about when running the same job on Databricks. Memory management: Databricks use an internal formula to allocate the Driver and executor heap based on the size of the instance....

2 kudos

06-23-2021 11:11:41 PM

by User16869510359 • Esteemed Contributor

06-23-2021 10:52:48 PM

1568 Views
1 replies
0 kudos

Resolved! How many cells can I create in a Notebook? Also are there any guidelines on the number of lines of code in a notebook cell?

Data Engineering

1568 Views
1 replies
0 kudos

06-23-2021 10:52:48 PM

View Replies

Latest Reply

User16869510359
Esteemed Contributor

06-23-2021 10:59:53 PM

0 kudos

Although not a hard limit, it's recommended to keep the number of cells in the notebook less than 100 for better UI experience as well as code readability. Having a really large block of code in a cell defeats the purpose of notebook execution and al...

0 kudos

06-23-2021 10:59:53 PM

by User16869510359 • Esteemed Contributor

06-23-2021 10:45:53 PM

9511 Views
1 replies
0 kudos

Resolved! Can I download files from DBFS to my local machine? I see only the Upload option in the Web UI.

Data Engineering

9511 Views
1 replies
0 kudos

06-23-2021 10:45:53 PM

View Replies

Latest Reply

User16869510359
Esteemed Contributor

06-23-2021 10:51:46 PM

0 kudos

Yes, it's possible to download files from DBFS. To download the filesFiles stored in /FileStore are accessible in your web browser at https://<databricks-instance-name>.cloud.databricks.com/files/. For example, the file you stored in /FileStore/my-da...

0 kudos

06-23-2021 10:51:46 PM

by User16783853501 • New Contributor II

06-23-2021 2:44:31 PM

688 Views
2 replies
0 kudos

What is the best way to convert a very large parquet table to delta ? possibly without downtime!

Data Engineering

688 Views
2 replies
0 kudos

06-23-2021 2:44:31 PM

View Replies

Latest Reply

User16869510359
Esteemed Contributor

06-23-2021 10:29:08 PM

0 kudos

I vouch for Sajith's answer. The main advantage with "CONVERT TO DELTA" is that operations are metadata centric which means we are not reading the full data for the conversion. For any other file format conversion, it's necessary to read the data com...

0 kudos

06-23-2021 10:29:08 PM

1 More Replies

by User16869510359 • Esteemed Contributor

06-23-2021 3:54:45 PM

675 Views
2 replies
0 kudos

Why should I move to Auto-loader?

I have a streaming workload using the S3-SQS Connector. The streaming job is running fine within the SLA. Should I migrate my job to use the auto-loader? If Yes, what are the benefits? who should migrate and who should not?

Data Engineering

675 Views
2 replies
0 kudos

06-23-2021 3:54:45 PM

View Replies

Latest Reply

User16869510359
Esteemed Contributor

06-23-2021 10:26:38 PM

0 kudos

That makes sense @Anand Ladda ! One major improvement that will have a direct impact on the performance is the architectural difference. S3-SQS uses an internal implementation of the Delta table to store the checkpoint details about the source files...

0 kudos

06-23-2021 10:26:38 PM

1 More Replies

by User16783853501 • New Contributor II

06-23-2021 2:16:18 PM

1126 Views
3 replies
0 kudos

best practice for optimizedWrites and Optimize

What is the best practice for a delta pipeline with very high throughput to avoid small files problem and also reduce the need for external OPTIMIZE frequently?

Data Engineering

1126 Views
3 replies
0 kudos

06-23-2021 2:16:18 PM

View Replies

Latest Reply

User16869510359
Esteemed Contributor

06-23-2021 10:21:44 PM

0 kudos

The general practice in use is to enable only optimize writes and disable auto-compaction. This is because the optimize writes will introduce an extra shuffle step which will increase the latency of the write operation. In addition to that, the auto-...

0 kudos

06-23-2021 10:21:44 PM

2 More Replies

by aladda • Honored Contributor II

06-23-2021 9:26:12 PM

1085 Views
1 replies
0 kudos

Resolved! How are stats collected on a Delta column utilized

Data Engineering

1085 Views
1 replies
0 kudos

06-23-2021 9:26:12 PM

View Replies

Latest Reply

aladda
Honored Contributor II

06-23-2021 9:28:13 PM

0 kudos

Stats collected on a Delta column are either using for Partitioning Pruning, Data Skipping. See here - https://docs.databricks.com/delta/optimizations/file-mgmt.html#delta-data-skipping for detailsIn additional stats are also used for Metadata only q...

0 kudos

06-23-2021 9:28:13 PM

by aladda • Honored Contributor II

06-23-2021 9:25:48 PM

647 Views
0 replies
0 kudos

What are the recommendations around collecting stats on long strings in a Delta Table

It is best to avoid collecting stats on long strings. You typically want to collect stats on column that are used in filter, where clauses, joins and on which you tend to performance aggregations - typically numerical valuesYou can avoid collecting s...

Data Engineering

647 Views
0 replies
0 kudos

06-23-2021 9:25:48 PM

by User16783853501 • New Contributor II

06-23-2021 6:55:35 PM

730 Views
2 replies
0 kudos

Delta Optimistic Transactions Resolution and Exceptions

What is the best way to deal with concurrent exceptions in Delta when you have multiple writers on the same delta table ?

Data Engineering

730 Views
2 replies
0 kudos

06-23-2021 6:55:35 PM

View Replies

Latest Reply

sajith_appukutt
Honored Contributor II

06-23-2021 9:22:18 PM

0 kudos

While you can try-catch-retry , it would be expensive to retry as the underlying table snapshot would have changed. So the best approach is to avoid conflicts using partitioning and disjoint command conditions as much as possible.

0 kudos

06-23-2021 9:22:18 PM

1 More Replies

by aladda • Honored Contributor II

06-23-2021 9:19:52 PM

2992 Views
1 replies
0 kudos

Resolved! How many columns does Delta Engine collect stats on for a Delta Table

Data Engineering

2992 Views
1 replies
0 kudos

06-23-2021 9:19:52 PM

View Replies

Latest Reply

aladda
Honored Contributor II

06-23-2021 9:22:03 PM

0 kudos

by default a delta table has stats collected on the first 32 columns. This setting can be configured using the following.set spark.databricks.delta.properties.defaults.dataSkippingNumIndexedCols = 3However there's a time trade-off to having a large n...

0 kudos

06-23-2021 9:22:03 PM

User

Count

1601

736

343

284

247

Databricks

Forum Posts

Resolved! How do I change the log level in Databricks?

Resolved! I do not have any Spark jobs running, but my cluster is not getting auto-terminated.

Resolved! How to use Databricks REST API within a notebook without providing tokens

Resolved! Can I turn off the notebook history feature?

Resolved! Why do I see my job marked as failed on the Databricks Jobs UI, even though it completed the operations in the application

Few things you should not do in Databricks!

Resolved! How many cells can I create in a Notebook? Also are there any guidelines on the number of lines of code in a notebook cell?

Resolved! Can I download files from DBFS to my local machine? I see only the Upload option in the Web UI.

What is the best way to convert a very large parquet table to delta ? possibly without downtime!

Why should I move to Auto-loader?

best practice for optimizedWrites and Optimize

Resolved! How are stats collected on a Delta column utilized

What are the recommendations around collecting stats on long strings in a Delta Table

Delta Optimistic Transactions Resolution and Exceptions

Resolved! How many columns does Delta Engine collect stats on for a Delta Table

DELTA_EXCEED_CHAR_VARCHAR_LIMIT

Not able to set run_as service_principal_name

Pyspark operations slowness in CLuster 14.3LTS as ...

[Databricks Assets Bundles] Workflow trigger on fi...

Addressing Pipeline Error Handling in Databricks b...