- 3144 Views
- 1 replies
- 0 kudos
Trying to configure new external metastore and running into the following exception during cluster initialization - Caused by: MetaException(message:Version information not found in metastore. )
at org.apache.hadoop.hive.metastore.RetryingHMSHandl...
- 3144 Views
- 1 replies
- 0 kudos
Latest Reply
The above exception happens when the hive schema is not available in the metastore instance. Please check in your init scripts to make sure the following flag is enabled to create hive Schema and tables if not already present. datanucleus.autoCreateA...
- 3477 Views
- 2 replies
- 0 kudos
What is the best way to capture the thread dump of the Spark driver process. Also, when should I capture the thread dump?
- 3477 Views
- 2 replies
- 0 kudos
Latest Reply
For Spark driver the process is the same. Choose the driver from the Executor page and view the thread dump. A thread dump is the footprints of the JVM they are very useful in debugging issues where the JVM process is stuck or making extremely slow p...
1 More Replies
- 990 Views
- 2 replies
- 0 kudos
With S3-SQS it was easier to identify the backlog ( the messages that are fetched from SQS and not consumed by the streaming job) How to find the same with Auto-loader
- 990 Views
- 2 replies
- 0 kudos
Latest Reply
For DBR 8.2 or later, the backlog details are captured in the Streaming metricsEg:
1 More Replies
- 1021 Views
- 1 replies
- 0 kudos
I am running jobs on Databricks using the Run Submit API with Airflow. I have noticed that rarely, a particular run is run more than one time at once. Why?
- 1021 Views
- 1 replies
- 0 kudos
Latest Reply
Idempotency can be ensured by providing the idempotency token. It's easy to pass the same through REST API as mentioned in the below doc:https://kb.databricks.com/jobs/jobs-idempotency.htmlThe primary reason for multiple runs is the client submits t...
- 635 Views
- 1 replies
- 0 kudos
How often should I run VACUUM commands? Will running the VACUUM command on a Delta table improve my read/write performance or is it just the storage benefits.
- 635 Views
- 1 replies
- 0 kudos
Latest Reply
VACUUM removes uncommitted/stale files from the Storage. The primary benefit is to save the storage cost. Ideally running VACUUM should not show any performance improvement as Delta does not list the storage directories but rather access the files di...
- 3373 Views
- 1 replies
- 2 kudos
I am moving my Spark workloads from EMR/on-premise Spark cluster to Databricks. I understand Databricks Spark is different from Yarn. How is the Databricks architecture different from yarn?
- 3373 Views
- 1 replies
- 2 kudos
Latest Reply
Users often compare Databricks cluster vs Yarn Cluster. It's not an Apple to Apple comparison. A Databricks cluster should be compared to a Spark Application that is submitted on Yarn. A Spark Application on Yarn will have a driver container and exe...
- 697 Views
- 1 replies
- 1 kudos
If I run OPTIMIZE on a Delta Lake table, will it prevent me from time travelling to a version before OPTIMIZE was run?
- 697 Views
- 1 replies
- 1 kudos
Latest Reply
No, you will still be able to time travel to versions previous to the OPTIMIZE command. OPTIMIZE is just another transaction like MERGE, UPDATE, etc. Check out these docs to learn more about retention periods and the VACUUM command.
- 1125 Views
- 1 replies
- 0 kudos
Looking to pull a list of all the users in their workspaces (including the ones who have never done anything), is there a way to do that? This for AWS
- 1125 Views
- 1 replies
- 0 kudos
Latest Reply
You can use the SKIM APIs. Endpoint: https://docs.databricks.com/dev-tools/api/latest/scim/scim-users.html#get-users Or you can use the Workspace API. The workspace API does not have a direct list users command, but you can use the workspace API to l...
- 5835 Views
- 2 replies
- 1 kudos
Is there an upper limit/recommended max value for no. of columns for Delta table?
- 5835 Views
- 2 replies
- 1 kudos
Latest Reply
Original answer posted by @Gray Gwizdz​ This was a fun question to try and find the answer to! Thank you for that I reviewed some of the most recent issues/bugs reported with Delta Lake and was able to find a similar issue where a user was running i...
1 More Replies
- 1099 Views
- 0 replies
- 1 kudos
Databricks Autoloader is a popular mechanism for ingesting data/files from cloud storage into Delta; for a very high throughput source, what are the best practices to be following while scaling up an autoloader based pipeline to the tune of millions ...
- 1099 Views
- 0 replies
- 1 kudos
- 960 Views
- 3 replies
- 0 kudos
I have single threaded Pandas code that is both not yet supported by Koalas nor easy to reimplement in PySpark. I would like to distribute this workload using Spark without rewriting all my Pandas code - is this possible?
- 960 Views
- 3 replies
- 0 kudos
Latest Reply
This is for a specific scenario where the code is not yet supported by Koalas. One approach to consider is using a Pandas UDF, and splitting up the work in a way that allows your processing to move forward. This notebook is a great example of taking ...
2 More Replies
- 3693 Views
- 2 replies
- 0 kudos
When is it recommended to use Trigger.once mode compared to fixed processing intervals with micro batches?
- 3693 Views
- 2 replies
- 0 kudos
Latest Reply
Also note, the configurations like maxFilesPerTrigger, maxBytesPerTrigger are ignored with Trigger.Once. Streaming queries with significantly less throughput can switch to Trigger.Once to avoid the continuous execution of the job checking the availab...
1 More Replies