Data Engineering

Forum Posts

Sorted by:

by User16783853906 • Contributor III

06-23-2021 6:02:29 PM

3144 Views
1 replies
0 kudos

Metaexception [Version information not found in metastore] during cluster [re]start

Trying to configure new external metastore and running into the following exception during cluster initialization - Caused by: MetaException(message:Version information not found in metastore. ) at org.apache.hadoop.hive.metastore.RetryingHMSHandl...

Data Engineering

3144 Views
1 replies
0 kudos

06-23-2021 6:02:29 PM

View Replies

Latest Reply

User16783853906
Contributor III

06-23-2021 6:08:12 PM

0 kudos

The above exception happens when the hive schema is not available in the metastore instance. Please check in your init scripts to make sure the following flag is enabled to create hive Schema and tables if not already present. datanucleus.autoCreateA...

0 kudos

06-23-2021 6:08:12 PM

by User16869510359 • Esteemed Contributor

06-23-2021 5:17:17 PM

752 Views
1 replies
0 kudos

Resolved! How to get the DBR version details on a HC Cluster

Data Engineering

752 Views
1 replies
0 kudos

06-23-2021 5:17:17 PM

View Replies

Latest Reply

User16869510359
Esteemed Contributor

06-23-2021 5:18:16 PM

0 kudos

The below code snippet can be used to get the DBR details on a HC clusterprint("hadoopVersion:" + sc._gateway.jvm.org.apache.hadoop.util.VersionInfo.getVersion()) print("baseVersion:" + sc._gateway.jvm.org.apache.spark.BuildInfo.sparkBranch()) print(...

0 kudos

06-23-2021 5:18:16 PM

by aladda • Honored Contributor II

06-18-2021 11:48:38 AM

596 Views
1 replies
0 kudos

Can Databricks notebooks be hosted in S3 or object storage?

Data Engineering

596 Views
1 replies
0 kudos

06-18-2021 11:48:38 AM

View Replies

Latest Reply

User16869510359
Esteemed Contributor

06-23-2021 5:09:28 PM

0 kudos

Databricks notebooks can be exported and stored in S3 or any other object storage. The internal storage of the databricks notebook cannot be changed or configured. The implementation is internal to Databicks control plane and not user configurable.

0 kudos

06-23-2021 5:09:28 PM

by User16869510359 • Esteemed Contributor

06-23-2021 4:51:48 PM

1239 Views
1 replies
0 kudos

Resolved! How to get the modification time of files from a notebook command?

Data Engineering

1239 Views
1 replies
0 kudos

06-23-2021 4:51:48 PM

View Replies

Latest Reply

User16869510359
Esteemed Contributor

06-23-2021 4:53:07 PM

0 kudos

The below code snippet is useful to get the modification time of files. %scala import scala.util.Try import org.apache.hadoop.conf.Configuration import org.apache.hadoop.fs.{FileSystem, Path} import org.apache.hadoop.io.IOUtils import java.io.IOExcep...

0 kudos

06-23-2021 4:53:07 PM

by User16869510359 • Esteemed Contributor

06-23-2021 4:42:59 PM

3477 Views
2 replies
0 kudos

Resolved! How and when to capture the thread dump of the Spark driver?

What is the best way to capture the thread dump of the Spark driver process. Also, when should I capture the thread dump?

Data Engineering

3477 Views
2 replies
0 kudos

06-23-2021 4:42:59 PM

View Replies

Latest Reply

User16869510359
Esteemed Contributor

06-23-2021 4:48:15 PM

0 kudos

For Spark driver the process is the same. Choose the driver from the Executor page and view the thread dump. A thread dump is the footprints of the JVM they are very useful in debugging issues where the JVM process is stuck or making extremely slow p...

0 kudos

06-23-2021 4:48:15 PM

1 More Replies

by User16869510359 • Esteemed Contributor

06-23-2021 4:25:32 PM

990 Views
2 replies
0 kudos

Resolved! Autoloader: How to identify the backlog in RocksDB

With S3-SQS it was easier to identify the backlog ( the messages that are fetched from SQS and not consumed by the streaming job) How to find the same with Auto-loader

Data Engineering

990 Views
2 replies
0 kudos

06-23-2021 4:25:32 PM

View Replies

Latest Reply

User16869510359
Esteemed Contributor

06-23-2021 4:29:42 PM

0 kudos

For DBR 8.2 or later, the backlog details are captured in the Streaming metricsEg:

0 kudos

06-23-2021 4:29:42 PM

1 More Replies

by User16783854657 • New Contributor III

06-08-2021 3:35:20 PM

1021 Views
1 replies
0 kudos

Resolved! How to ensure that a Databricks Run Submit run invoked from Airflow only runs one time?

I am running jobs on Databricks using the Run Submit API with Airflow. I have noticed that rarely, a particular run is run more than one time at once. Why?

Data Engineering

1021 Views
1 replies
0 kudos

06-08-2021 3:35:20 PM

View Replies

Latest Reply

User16869510359
Esteemed Contributor

06-23-2021 4:17:27 PM

0 kudos

Idempotency can be ensured by providing the idempotency token. It's easy to pass the same through REST API as mentioned in the below doc:https://kb.databricks.com/jobs/jobs-idempotency.htmlThe primary reason for multiple runs is the client submits t...

0 kudos

06-23-2021 4:17:27 PM

by User16869510359 • Esteemed Contributor

06-23-2021 4:04:09 PM

635 Views
1 replies
0 kudos

Resolved! Performance improvement after running VACUUM commands

How often should I run VACUUM commands? Will running the VACUUM command on a Delta table improve my read/write performance or is it just the storage benefits.

Data Engineering

635 Views
1 replies
0 kudos

06-23-2021 4:04:09 PM

View Replies

Latest Reply

User16869510359
Esteemed Contributor

06-23-2021 4:08:03 PM

0 kudos

VACUUM removes uncommitted/stale files from the Storage. The primary benefit is to save the storage cost. Ideally running VACUUM should not show any performance improvement as Delta does not list the storage directories but rather access the files di...

0 kudos

06-23-2021 4:08:03 PM

by User16869510359 • Esteemed Contributor

06-23-2021 8:25:02 AM

3373 Views
1 replies
2 kudos

Resolved! Databricks Spark Vs Spark on Yarn

I am moving my Spark workloads from EMR/on-premise Spark cluster to Databricks. I understand Databricks Spark is different from Yarn. How is the Databricks architecture different from yarn?

Data Engineering

3373 Views
1 replies
2 kudos

06-23-2021 8:25:02 AM

View Replies

Latest Reply

User16869510359
Esteemed Contributor

06-23-2021 3:48:33 PM

2 kudos

Users often compare Databricks cluster vs Yarn Cluster. It's not an Apple to Apple comparison. A Databricks cluster should be compared to a Spark Application that is submitted on Yarn. A Spark Application on Yarn will have a driver container and exe...

2 kudos

06-23-2021 3:48:33 PM

by User16783854657 • New Contributor III

06-23-2021 3:08:09 PM

697 Views
1 replies
1 kudos

Does running OPTIMIZE on a delta table destroy the transaction history of table?

If I run OPTIMIZE on a Delta Lake table, will it prevent me from time travelling to a version before OPTIMIZE was run?

Data Engineering

697 Views
1 replies
1 kudos

06-23-2021 3:08:09 PM

View Replies

Latest Reply

User16783854657
New Contributor III

06-23-2021 3:34:03 PM

1 kudos

No, you will still be able to time travel to versions previous to the OPTIMIZE command. OPTIMIZE is just another transaction like MERGE, UPDATE, etc. Check out these docs to learn more about retention periods and the VACUUM command.

1 kudos

06-23-2021 3:34:03 PM

by User16826987838 • Contributor

06-23-2021 2:23:23 PM

1125 Views
1 replies
0 kudos

How do I find the users in workspaces

Looking to pull a list of all the users in their workspaces (including the ones who have never done anything), is there a way to do that? This for AWS

Data Engineering

1125 Views
1 replies
0 kudos

06-23-2021 2:23:23 PM

View Replies

Latest Reply

Ryan_Chynoweth
Honored Contributor III

06-23-2021 2:59:11 PM

0 kudos

You can use the SKIM APIs. Endpoint: https://docs.databricks.com/dev-tools/api/latest/scim/scim-users.html#get-users Or you can use the Workspace API. The workspace API does not have a direct list users command, but you can use the workspace API to l...

0 kudos

06-23-2021 2:59:11 PM

by User16783853906 • Contributor III

06-08-2021 3:25:34 PM

5835 Views
2 replies
1 kudos

Resolved! Max Columns for Delta table

Is there an upper limit/recommended max value for no. of columns for Delta table?

Data Engineering

5835 Views
2 replies
1 kudos

06-08-2021 3:25:34 PM

View Replies

Latest Reply

User16783853906
Contributor III

06-23-2021 2:35:20 PM

1 kudos

Original answer posted by @Gray Gwizdz This was a fun question to try and find the answer to! Thank you for that I reviewed some of the most recent issues/bugs reported with Delta Lake and was able to find a similar issue where a user was running i...

1 kudos

06-23-2021 2:35:20 PM

1 More Replies

by User16783853501 • New Contributor II

06-23-2021 2:28:35 PM

1099 Views
0 replies
1 kudos

Databricks Autoloader Best practice

Databricks Autoloader is a popular mechanism for ingesting data/files from cloud storage into Delta; for a very high throughput source, what are the best practices to be following while scaling up an autoloader based pipeline to the tune of millions ...

Data Engineering

1099 Views
0 replies
1 kudos

06-23-2021 2:28:35 PM

by User16783853906 • Contributor III

06-08-2021 2:44:50 PM

960 Views
3 replies
0 kudos

Resolved! How to resuse Pandas code in PySpark?

I have single threaded Pandas code that is both not yet supported by Koalas nor easy to reimplement in PySpark. I would like to distribute this workload using Spark without rewriting all my Pandas code - is this possible?

Data Engineering

960 Views
3 replies
0 kudos

06-08-2021 2:44:50 PM

View Replies

Latest Reply

User16783853906
Contributor III

06-23-2021 2:28:25 PM

0 kudos

This is for a specific scenario where the code is not yet supported by Koalas. One approach to consider is using a Pandas UDF, and splitting up the work in a way that allows your processing to move forward. This notebook is a great example of taking ...

0 kudos

06-23-2021 2:28:25 PM

2 More Replies

by User16783853906 • Contributor III

06-23-2021 2:14:56 PM

3693 Views
2 replies
0 kudos

Trigger.once mode recommendation

When is it recommended to use Trigger.once mode compared to fixed processing intervals with micro batches?

Data Engineering

3693 Views
2 replies
0 kudos

06-23-2021 2:14:56 PM

View Replies

Latest Reply

User16869510359
Esteemed Contributor

06-23-2021 2:26:12 PM

0 kudos

Also note, the configurations like maxFilesPerTrigger, maxBytesPerTrigger are ignored with Trigger.Once. Streaming queries with significantly less throughput can switch to Trigger.Once to avoid the continuous execution of the job checking the availab...

0 kudos

06-23-2021 2:26:12 PM

1 More Replies

User

Count

1602

736

344

284

247

Databricks

Forum Posts

Metaexception [Version information not found in metastore] during cluster [re]start

Resolved! How to get the DBR version details on a HC Cluster

Can Databricks notebooks be hosted in S3 or object storage?

Resolved! How to get the modification time of files from a notebook command?

Resolved! How and when to capture the thread dump of the Spark driver?

Resolved! Autoloader: How to identify the backlog in RocksDB

Resolved! How to ensure that a Databricks Run Submit run invoked from Airflow only runs one time?

Resolved! Performance improvement after running VACUUM commands

Resolved! Databricks Spark Vs Spark on Yarn

Does running OPTIMIZE on a delta table destroy the transaction history of table?

How do I find the users in workspaces

Resolved! Max Columns for Delta table

Databricks Autoloader Best practice

Resolved! How to resuse Pandas code in PySpark?

Trigger.once mode recommendation

Best way to parse Google Analytics data in Databri...

DELTA_EXCEED_CHAR_VARCHAR_LIMIT

Not able to set run_as service_principal_name

Pyspark operations slowness in CLuster 14.3LTS as ...

[Databricks Assets Bundles] Workflow trigger on fi...