cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 
Data + AI Summit 2024 - Data Engineering & Streaming

Forum Posts

User16826994223
by Honored Contributor III
  • 830 Views
  • 1 replies
  • 0 kudos

DBFS root resides in the Customer account or Databricks Account

IF I installed the root Bucket I see a root bucket is created with workspace, Does this bucket resided in Customer account or Databricks Account. How can I Access the bucket and can i see this bucket directly in s3 or ADLS

  • 830 Views
  • 1 replies
  • 0 kudos
Latest Reply
sajith_appukutt
Honored Contributor II
  • 0 kudos

Didin't get the reference about installing bucket ? did you mean configured a workspace with root bucket. If so, you'd have probably gathered that root storage for a workspace resides in customer's account

  • 0 kudos
Ryan_Chynoweth
by Honored Contributor III
  • 2042 Views
  • 2 replies
  • 1 kudos
  • 2042 Views
  • 2 replies
  • 1 kudos
Latest Reply
User16783853906
Contributor III
  • 1 kudos

Delta cache accelerates data reads by creating copies of remote files in nodes’ local storage using a fast intermediate data format. The data is cached automatically whenever a file has to be fetched from a remote location. Successive reads of the sa...

  • 1 kudos
1 More Replies
User16783853501
by New Contributor II
  • 1066 Views
  • 1 replies
  • 1 kudos

Converting data that is in Delta format to plain parquet format

Many a times there is a need to convert Delta tables from Delta format to plain parquet format for a number of reasons, what is the best way to do that?

  • 1066 Views
  • 1 replies
  • 1 kudos
Latest Reply
User16826994223
Honored Contributor III
  • 1 kudos

You can easily convert a Delta table back to a Parquet table using the following steps:If you have performed Delta Lake operations that can change the data files (for example, delete or merge, run vacuum with retention of 0 hours to delete all data f...

  • 1 kudos
User16783853906
by Contributor III
  • 3557 Views
  • 1 replies
  • 0 kudos

Metaexception [Version information not found in metastore] during cluster [re]start

Trying to configure new external metastore and running into the following exception during cluster initialization - Caused by: MetaException(message:Version information not found in metastore. )   at org.apache.hadoop.hive.metastore.RetryingHMSHandl...

  • 3557 Views
  • 1 replies
  • 0 kudos
Latest Reply
User16783853906
Contributor III
  • 0 kudos

The above exception happens when the hive schema is not available in the metastore instance. Please check in your init scripts to make sure the following flag is enabled to create hive Schema and tables if not already present. datanucleus.autoCreateA...

  • 0 kudos
brickster_2018
by Esteemed Contributor
  • 983 Views
  • 1 replies
  • 0 kudos
  • 983 Views
  • 1 replies
  • 0 kudos
Latest Reply
brickster_2018
Esteemed Contributor
  • 0 kudos

The below code snippet can be used to get the DBR details on a HC clusterprint("hadoopVersion:" + sc._gateway.jvm.org.apache.hadoop.util.VersionInfo.getVersion()) print("baseVersion:" + sc._gateway.jvm.org.apache.spark.BuildInfo.sparkBranch()) print(...

  • 0 kudos
aladda
by Honored Contributor II
  • 777 Views
  • 1 replies
  • 0 kudos
  • 777 Views
  • 1 replies
  • 0 kudos
Latest Reply
brickster_2018
Esteemed Contributor
  • 0 kudos

Databricks notebooks can be exported and stored in S3 or any other object storage. The internal storage of the databricks notebook cannot be changed or configured. The implementation is internal to Databicks control plane and not user configurable.

  • 0 kudos
brickster_2018
by Esteemed Contributor
  • 1628 Views
  • 1 replies
  • 0 kudos
  • 1628 Views
  • 1 replies
  • 0 kudos
Latest Reply
brickster_2018
Esteemed Contributor
  • 0 kudos

The below code snippet is useful to get the modification time of files. %scala import scala.util.Try import org.apache.hadoop.conf.Configuration import org.apache.hadoop.fs.{FileSystem, Path} import org.apache.hadoop.io.IOUtils import java.io.IOExcep...

  • 0 kudos
brickster_2018
by Esteemed Contributor
  • 4145 Views
  • 2 replies
  • 0 kudos

Resolved! How and when to capture the thread dump of the Spark driver?

What is the best way to capture the thread dump of the Spark driver process. Also, when should I capture the thread dump?

  • 4145 Views
  • 2 replies
  • 0 kudos
Latest Reply
brickster_2018
Esteemed Contributor
  • 0 kudos

For Spark driver the process is the same. Choose the driver from the Executor page and view the thread dump. A thread dump is the footprints of the JVM they are very useful in debugging issues where the JVM process is stuck or making extremely slow p...

  • 0 kudos
1 More Replies
brickster_2018
by Esteemed Contributor
  • 1326 Views
  • 2 replies
  • 0 kudos

Resolved! Autoloader: How to identify the backlog in RocksDB

With S3-SQS it was easier to identify the backlog ( the messages that are fetched from SQS and not consumed by the streaming job) How to find the same with Auto-loader

  • 1326 Views
  • 2 replies
  • 0 kudos
Latest Reply
brickster_2018
Esteemed Contributor
  • 0 kudos

For DBR 8.2 or later, the backlog details are captured in the Streaming metricsEg:

  • 0 kudos
1 More Replies
User16783854657
by New Contributor III
  • 1297 Views
  • 1 replies
  • 0 kudos

Resolved! How to ensure that a Databricks Run Submit run invoked from Airflow only runs one time?

I am running jobs on Databricks using the Run Submit API with Airflow. I have noticed that rarely, a particular run is run more than one time at once. Why?

  • 1297 Views
  • 1 replies
  • 0 kudos
Latest Reply
brickster_2018
Esteemed Contributor
  • 0 kudos

Idempotency can be ensured by providing the idempotency token. It's easy to pass the same through REST API as mentioned in the below doc:https://kb.databricks.com/jobs/jobs-idempotency.htmlThe primary reason for multiple runs is the client submits t...

  • 0 kudos
brickster_2018
by Esteemed Contributor
  • 885 Views
  • 1 replies
  • 0 kudos

Resolved! Performance improvement after running VACUUM commands

How often should I run VACUUM commands? Will running the VACUUM command on a Delta table improve my read/write performance or is it just the storage benefits.

  • 885 Views
  • 1 replies
  • 0 kudos
Latest Reply
brickster_2018
Esteemed Contributor
  • 0 kudos

VACUUM removes uncommitted/stale files from the Storage. The primary benefit is to save the storage cost. Ideally running VACUUM should not show any performance improvement as Delta does not list the storage directories but rather access the files di...

  • 0 kudos
brickster_2018
by Esteemed Contributor
  • 3932 Views
  • 1 replies
  • 2 kudos

Resolved! Databricks Spark Vs Spark on Yarn

I am moving my Spark workloads from EMR/on-premise Spark cluster to Databricks. I understand Databricks Spark is different from Yarn. How is the Databricks architecture different from yarn?

  • 3932 Views
  • 1 replies
  • 2 kudos
Latest Reply
brickster_2018
Esteemed Contributor
  • 2 kudos

Users often compare Databricks cluster vs Yarn Cluster. It's not an Apple to Apple comparison. A Databricks cluster should be compared to a Spark Application that is submitted on Yarn. A Spark Application on Yarn will have a driver container and exe...

  • 2 kudos
User16783854657
by New Contributor III
  • 999 Views
  • 1 replies
  • 1 kudos

Does running OPTIMIZE on a delta table destroy the transaction history of table?

If I run OPTIMIZE on a Delta Lake table, will it prevent me from time travelling to a version before OPTIMIZE was run?

  • 999 Views
  • 1 replies
  • 1 kudos
Latest Reply
User16783854657
New Contributor III
  • 1 kudos

No, you will still be able to time travel to versions previous to the OPTIMIZE command. OPTIMIZE is just another transaction like MERGE, UPDATE, etc. Check out these docs to learn more about retention periods and the VACUUM command.

  • 1 kudos
User16826987838
by Contributor
  • 1441 Views
  • 1 replies
  • 0 kudos

How do I find the users in workspaces

Looking to pull a list of all the users in their workspaces (including the ones who have never done anything), is there a way to do that? This for AWS

  • 1441 Views
  • 1 replies
  • 0 kudos
Latest Reply
Ryan_Chynoweth
Honored Contributor III
  • 0 kudos

You can use the SKIM APIs. Endpoint: https://docs.databricks.com/dev-tools/api/latest/scim/scim-users.html#get-users Or you can use the Workspace API. The workspace API does not have a direct list users command, but you can use the workspace API to l...

  • 0 kudos
User16783853906
by Contributor III
  • 7201 Views
  • 2 replies
  • 1 kudos

Resolved! Max Columns for Delta table

Is there an upper limit/recommended max value for no. of columns for Delta table?

  • 7201 Views
  • 2 replies
  • 1 kudos
Latest Reply
User16783853906
Contributor III
  • 1 kudos

Original answer posted by @Gray Gwizdz​ This was a fun question to try and find the answer to! Thank you for that I reviewed some of the most recent issues/bugs reported with Delta Lake and was able to find a similar issue where a user was running i...

  • 1 kudos
1 More Replies
Join 100K+ Data Experts: Register Now & Grow with Us!

Excited to expand your horizons with us? Click here to Register and begin your journey to success!

Already a member? Login and join your local regional user group! If there isn’t one near you, fill out this form and we’ll create one for you to join!

Labels