Data Engineering

Forum Posts

Sorted by:

Start a conversation

by aladda • Databricks Employee

06-23-2021 8:40:03 PM

2181 Views
1 replies
0 kudos

Resolved! What are the different options for dealing with invalid records in a Delta Live Table

Data Engineering

2181 Views
1 replies
0 kudos

06-23-2021 8:40:03 PM

View Replies

Latest Reply

aladda
Databricks Employee

06-23-2021 8:41:27 PM

0 kudos

Delta Live Table supports the data quality checks via expectations. On encountering invalid records you can choose to either retain them, drop them or fail/stop the pipeline. See the link below for additional detailshttps://docs.databricks.com/data-e...

0 kudos

06-23-2021 8:41:27 PM

by aladda • Databricks Employee

06-23-2021 8:28:12 PM

1947 Views
1 replies
0 kudos

Resolved! Can you publish results of a Delta Live Table pipeline to a database in the metastore?

Data Engineering

1947 Views
1 replies
0 kudos

06-23-2021 8:28:12 PM

View Replies

Latest Reply

aladda
Databricks Employee

06-23-2021 8:29:41 PM

0 kudos

Yes. You can specify a "target" database as part of your DLT pipeline configuration to publish results to a target database in the metastore. See - https://docs.databricks.com/data-engineering/delta-live-tables/delta-live-tables-quickstart.html#publi...

0 kudos

06-23-2021 8:29:41 PM

by aladda • Databricks Employee

06-23-2021 8:22:31 PM

1586 Views
1 replies
0 kudos

Where are the results of a Delta Live Table Pipeline published to?

Data Engineering

1586 Views
1 replies
0 kudos

06-23-2021 8:22:31 PM

View Replies

Latest Reply

aladda
Databricks Employee

06-23-2021 8:26:09 PM

0 kudos

DLT Pipeline results are published to the "Storage Location" defined as part of configuring the Pipeline. Ex:- https://docs.databricks.com/_images/dlt-create-notebook-pipeline.pngIf an explicit Storage Location is not specified, the pipeline results ...

0 kudos

06-23-2021 8:26:09 PM

by aladda • Databricks Employee

06-23-2021 8:04:53 PM

2218 Views
1 replies
1 kudos

Resolved! Why can't I run a notebook that has Delta Live Tables/Views defined against a cluster like other Notebooks

Data Engineering

2218 Views
1 replies
1 kudos

06-23-2021 8:04:53 PM

View Replies

Latest Reply

aladda
Databricks Employee

06-23-2021 8:13:06 PM

1 kudos

Notebooks with Delta Live Table/View definition just contain the pipeline definition. In order to execute Delta Live Tables Notebooks you need to define a Pipeline via the Jobs UI. Pipeline carries with it the logic to build the dependency graph betw...

1 kudos

06-23-2021 8:13:06 PM

by User16137833804 • Databricks Employee

06-23-2021 1:14:27 PM

2357 Views
1 replies
1 kudos

Once I set up the Git Server Proxy, what would be the best way to set alerts in case the Cluster Proxy goes down?

Data Engineering

2357 Views
1 replies
1 kudos

06-23-2021 1:14:27 PM

View Replies

Latest Reply

sajith_appukutt
Databricks Employee

06-23-2021 7:51:55 PM

1 kudos

You could have the single node cluster where proxy is installed monitored by one of the tools like cloudwatch, azure monitor, datadog etc and have it configured to send alerts on node failure

1 kudos

06-23-2021 7:51:55 PM

by User16826994223 • Databricks Employee

06-23-2021 12:07:44 AM

1701 Views
1 replies
0 kudos

DBFS root resides in the Customer account or Databricks Account

IF I installed the root Bucket I see a root bucket is created with workspace, Does this bucket resided in Customer account or Databricks Account. How can I Access the bucket and can i see this bucket directly in s3 or ADLS

Data Engineering

1701 Views
1 replies
0 kudos

06-23-2021 12:07:44 AM

View Replies

Latest Reply

sajith_appukutt
Databricks Employee

06-23-2021 7:28:11 PM

0 kudos

Didin't get the reference about installing bucket ? did you mean configured a workspace with root bucket. If so, you'd have probably gathered that root storage for a workspace resides in customer's account

0 kudos

06-23-2021 7:28:11 PM

by Ryan_Chynoweth • Databricks Employee

06-04-2021 11:42:45 AM

5886 Views
2 replies
1 kudos

What advantage is there to Databricks caching and Spark caching?

Data Engineering

5886 Views
2 replies
1 kudos

06-04-2021 11:42:45 AM

View Replies

Latest Reply

User16783853906
Databricks Employee

06-23-2021 6:37:33 PM

1 kudos

Delta cache accelerates data reads by creating copies of remote files in nodes’ local storage using a fast intermediate data format. The data is cached automatically whenever a file has to be fetched from a remote location. Successive reads of the sa...

1 kudos

06-23-2021 6:37:33 PM

1 More Replies

by User16783853501 • Databricks Employee

06-23-2021 2:36:44 PM

4744 Views
1 replies
1 kudos

Converting data that is in Delta format to plain parquet format

Many a times there is a need to convert Delta tables from Delta format to plain parquet format for a number of reasons, what is the best way to do that?

Data Engineering

4744 Views
1 replies
1 kudos

06-23-2021 2:36:44 PM

View Replies

Latest Reply

User16826994223
Databricks Employee

06-23-2021 6:11:30 PM

1 kudos

You can easily convert a Delta table back to a Parquet table using the following steps:If you have performed Delta Lake operations that can change the data files (for example, delete or merge, run vacuum with retention of 0 hours to delete all data f...

1 kudos

06-23-2021 6:11:30 PM

by User16783853906 • Databricks Employee

06-23-2021 6:02:29 PM

6260 Views
1 replies
0 kudos

Metaexception [Version information not found in metastore] during cluster [re]start

Trying to configure new external metastore and running into the following exception during cluster initialization - Caused by: MetaException(message:Version information not found in metastore. ) at org.apache.hadoop.hive.metastore.RetryingHMSHandl...

Data Engineering

6260 Views
1 replies
0 kudos

06-23-2021 6:02:29 PM

View Replies

Latest Reply

User16783853906
Databricks Employee

06-23-2021 6:08:12 PM

0 kudos

The above exception happens when the hive schema is not available in the metastore instance. Please check in your init scripts to make sure the following flag is enabled to create hive Schema and tables if not already present. datanucleus.autoCreateA...

0 kudos

06-23-2021 6:08:12 PM

by brickster_2018 • Databricks Employee

06-23-2021 5:17:17 PM

2127 Views
1 replies
0 kudos

Resolved! How to get the DBR version details on a HC Cluster

Data Engineering

2127 Views
1 replies
0 kudos

06-23-2021 5:17:17 PM

View Replies

Latest Reply

brickster_2018
Databricks Employee

06-23-2021 5:18:16 PM

0 kudos

The below code snippet can be used to get the DBR details on a HC clusterprint("hadoopVersion:" + sc._gateway.jvm.org.apache.hadoop.util.VersionInfo.getVersion()) print("baseVersion:" + sc._gateway.jvm.org.apache.spark.BuildInfo.sparkBranch()) print(...

0 kudos

06-23-2021 5:18:16 PM

by aladda • Databricks Employee

06-18-2021 11:48:38 AM

1508 Views
1 replies
0 kudos

Can Databricks notebooks be hosted in S3 or object storage?

Data Engineering

1508 Views
1 replies
0 kudos

06-18-2021 11:48:38 AM

View Replies

Latest Reply

brickster_2018
Databricks Employee

06-23-2021 5:09:28 PM

0 kudos

Databricks notebooks can be exported and stored in S3 or any other object storage. The internal storage of the databricks notebook cannot be changed or configured. The implementation is internal to Databicks control plane and not user configurable.

0 kudos

06-23-2021 5:09:28 PM

by brickster_2018 • Databricks Employee

06-23-2021 4:51:48 PM

4108 Views
1 replies
0 kudos

Resolved! How to get the modification time of files from a notebook command?

Data Engineering

4108 Views
1 replies
0 kudos

06-23-2021 4:51:48 PM

View Replies

Latest Reply

brickster_2018
Databricks Employee

06-23-2021 4:53:07 PM

0 kudos

The below code snippet is useful to get the modification time of files. %scala import scala.util.Try import org.apache.hadoop.conf.Configuration import org.apache.hadoop.fs.{FileSystem, Path} import org.apache.hadoop.io.IOUtils import java.io.IOExcep...

0 kudos

06-23-2021 4:53:07 PM

by brickster_2018 • Databricks Employee

06-23-2021 4:42:59 PM

7730 Views
2 replies
0 kudos

Resolved! How and when to capture the thread dump of the Spark driver?

What is the best way to capture the thread dump of the Spark driver process. Also, when should I capture the thread dump?

Data Engineering

7730 Views
2 replies
0 kudos

06-23-2021 4:42:59 PM

View Replies

Latest Reply

brickster_2018
Databricks Employee

06-23-2021 4:48:15 PM

0 kudos

For Spark driver the process is the same. Choose the driver from the Executor page and view the thread dump. A thread dump is the footprints of the JVM they are very useful in debugging issues where the JVM process is stuck or making extremely slow p...

0 kudos

06-23-2021 4:48:15 PM

1 More Replies

by brickster_2018 • Databricks Employee

06-23-2021 4:25:32 PM

3306 Views
2 replies
0 kudos

Resolved! Autoloader: How to identify the backlog in RocksDB

With S3-SQS it was easier to identify the backlog ( the messages that are fetched from SQS and not consumed by the streaming job) How to find the same with Auto-loader

Data Engineering

3306 Views
2 replies
0 kudos

06-23-2021 4:25:32 PM

View Replies

Latest Reply

brickster_2018
Databricks Employee

06-23-2021 4:29:42 PM

0 kudos

For DBR 8.2 or later, the backlog details are captured in the Streaming metricsEg:

0 kudos

06-23-2021 4:29:42 PM

1 More Replies

by cgrant • Databricks Employee

06-08-2021 3:35:20 PM

3780 Views
1 replies
0 kudos

Resolved! How to ensure that a Databricks Run Submit run invoked from Airflow only runs one time?

I am running jobs on Databricks using the Run Submit API with Airflow. I have noticed that rarely, a particular run is run more than one time at once. Why?

Data Engineering

3780 Views
1 replies
0 kudos

06-08-2021 3:35:20 PM

View Replies

Latest Reply

brickster_2018
Databricks Employee

06-23-2021 4:17:27 PM

0 kudos

Idempotency can be ensured by providing the idempotency token. It's easy to pass the same through REST API as mentioned in the below doc:https://kb.databricks.com/jobs/jobs-idempotency.htmlThe primary reason for multiple runs is the client submits t...

0 kudos

06-23-2021 4:17:27 PM

Databricks Community

Forum Posts

Resolved! What are the different options for dealing with invalid records in a Delta Live Table

Resolved! Can you publish results of a Delta Live Table pipeline to a database in the metastore?

Where are the results of a Delta Live Table Pipeline published to?

Resolved! Why can't I run a notebook that has Delta Live Tables/Views defined against a cluster like other Notebooks

Once I set up the Git Server Proxy, what would be the best way to set alerts in case the Cluster Proxy goes down?

DBFS root resides in the Customer account or Databricks Account

What advantage is there to Databricks caching and Spark caching?

Converting data that is in Delta format to plain parquet format

Metaexception [Version information not found in metastore] during cluster [re]start

Resolved! How to get the DBR version details on a HC Cluster

Can Databricks notebooks be hosted in S3 or object storage?

Resolved! How to get the modification time of files from a notebook command?

Resolved! How and when to capture the thread dump of the Spark driver?

Resolved! Autoloader: How to identify the backlog in RocksDB

Resolved! How to ensure that a Databricks Run Submit run invoked from Airflow only runs one time?

DLT with CDC and schema changes in streaming pipel...

how to update not tracked column only in new row v...

Databricks Cost Estimation Template

Use .R file in data pipeline

CVE-2023-51385 and CVE-2023-38408 in Runtime 17.3 ...