cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

Forum Posts

User16869510359
by Esteemed Contributor
  • 634 Views
  • 1 replies
  • 0 kudos
  • 634 Views
  • 1 replies
  • 0 kudos
Latest Reply
User16869510359
Esteemed Contributor
  • 0 kudos

While using MERGE INTO statement, if the source data that will be merged into the target delta table is small enough to be fit into memory of the worker nodes, then it makes sense to broadcast the source data. By doing so, the execution can avoid the...

  • 0 kudos
User16869510359
by Esteemed Contributor
  • 1889 Views
  • 1 replies
  • 0 kudos

Resolved! Can Spark JDBC create duplicate records

Is it transaction safe?Does it ensure atomicity

  • 1889 Views
  • 1 replies
  • 0 kudos
Latest Reply
User16869510359
Esteemed Contributor
  • 0 kudos

Atomicity is ensured at a task level and not at a stage level. For any reason, if the stage is getting retried, the tasks which already completed the write operation will re-run and cause duplicate records. This is expected by design. When Apache Spa...

  • 0 kudos
User16869510359
by Esteemed Contributor
  • 956 Views
  • 1 replies
  • 1 kudos

Resolved! What is the best practice of deleting the complete data from Delta table

I have a use case where I need to delete the data completely and load new data to the existing Delta table. 

  • 956 Views
  • 1 replies
  • 1 kudos
Latest Reply
User16869510359
Esteemed Contributor
  • 1 kudos

It's recommended to use the overwrite option. Overwrite the table data and run a VACUUM command. To Delete the data from a Managed Delta table, the DROP TABLE command can be used. If it's an external table, then run a DELETE query on the table and th...

  • 1 kudos
User16765131552
by Contributor III
  • 758 Views
  • 1 replies
  • 0 kudos

Resolved! Cluster Log Partitioning

Customer wants to understand our strategy for breaking cluster logs into different partitions and files. They want to be able to ingest these logs into a tool that needs to understand this. They have indicated that the logs used to all be in one file...

  • 758 Views
  • 1 replies
  • 0 kudos
Latest Reply
User16765131552
Contributor III
  • 0 kudos

Log files are rolled over by time/size criteria.

  • 0 kudos
User16765131552
by Contributor III
  • 276 Views
  • 0 replies
  • 0 kudos

docs.databricks.com

Best practices for Databricks pools — Databricks DocumentationLearn best practices for configuring and using Databricks pools.https://docs.databricks.com/clusters/instance-pools/pool-best-practices.htmlBest practices for Azure Databricks pools - Azur...

  • 276 Views
  • 0 replies
  • 0 kudos
User16869510359
by Esteemed Contributor
  • 769 Views
  • 1 replies
  • 0 kudos
  • 769 Views
  • 1 replies
  • 0 kudos
Latest Reply
User16869510359
Esteemed Contributor
  • 0 kudos

Pre-emption is by default turned on Databricks cluster. Turning on or turning off pre-emption would make more sense on a high concurrency cluster. Pre-emption ensures that the job starting for resources gets a fair share of the resource available on ...

  • 0 kudos
User16869510359
by Esteemed Contributor
  • 1265 Views
  • 1 replies
  • 0 kudos

Resolved! How to uninstall libraries that are set to auto-install on all cluster - using REST API

I have a bunch of libraries that I want to uninstall. All of them are marked as auto-install.

  • 1265 Views
  • 1 replies
  • 0 kudos
Latest Reply
User16869510359
Esteemed Contributor
  • 0 kudos

1) Find the corresponding library definition from an existing cluster using "libraries/cluster-status?cluster_id=<cluster_id>".$ curl -X GET 'https://cust-success.cloud.databricks.com/api/2.0/libraries/cluster-status?cluster_id=1226-232931-cuffs129' ...

  • 0 kudos
User16765131552
by Contributor III
  • 889 Views
  • 1 replies
  • 1 kudos

Resolved! Saving Files Location

If someone saves a flat file from a cell without specifying any location, where does it save?

screen_shot_2021-04-16_at_1.32.57_pm
  • 889 Views
  • 1 replies
  • 1 kudos
Latest Reply
User16765131552
Contributor III
  • 1 kudos

In this case they are writing to a directory on the driver.

  • 1 kudos
User16869510359
by Esteemed Contributor
  • 2065 Views
  • 1 replies
  • 0 kudos

Resolved! Super slow SQL queries on an HC cluster

I have a high concurrency cluster where multiple users are running. However, I see the queries are running very slow. I did debug the logs and see more time is spent on the Spark driver. on the Spark UI, I do not see slowness.

  • 2065 Views
  • 1 replies
  • 0 kudos
Latest Reply
User16869510359
Esteemed Contributor
  • 0 kudos

It's possible the connectivity to hive metastore is causing the delay here. When there is a high degree of concurrency and contention for metastore access. Interactive clusters in DBR are configured to use up to 5 (spark.databricks.hive.metastore.cli...

  • 0 kudos
User16826994223
by Honored Contributor III
  • 646 Views
  • 1 replies
  • 0 kudos

Resolved! versioning of delta table while writing from a structured streaming job

Does writing to a Delta table create a versioning for every micro-batch of stream

  • 646 Views
  • 1 replies
  • 0 kudos
Latest Reply
User16826994223
Honored Contributor III
  • 0 kudos

Yes it is correct - Every commits to the delta create a version so definitely each micro batch create a version More Info -: https://databricks.com/blog/2019/02/04/introducing-delta-time-travel-for-large-scale-data-lakes.html

  • 0 kudos
User16826994223
by Honored Contributor III
  • 1336 Views
  • 1 replies
  • 1 kudos

spark data frame parquet vs delta : rows Doesn't match

I have data written in Delta on ADLS. As I understand the delta also internal file in parquet format but when Iread the file in different format I got different record countspark.read.parquet() or spark.read.format('delta').load()df = spark.read.for...

  • 1336 Views
  • 1 replies
  • 1 kudos
Latest Reply
User16826994223
Honored Contributor III
  • 1 kudos

I think you have written in delta twice using overwrite mode .But Delta is versioned data format - when you use overwrite, it doesn't delete previous data, it just writes new files, and don't delete files immediately - they are just marked as delete...

  • 1 kudos
MoJaMa
by Valued Contributor II
  • 1593 Views
  • 3 replies
  • 1 kudos
  • 1593 Views
  • 3 replies
  • 1 kudos
Latest Reply
User16783853906
Contributor III
  • 1 kudos

Databricks does not charge DBUs while instances are idle in the pool. Instance provider billing does apply.Please refer here for more information - https://docs.databricks.com/clusters/instance-pools/index.html

  • 1 kudos
2 More Replies
Labels
Top Kudoed Authors