cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 
Data + AI Summit 2024 - Data Engineering & Streaming

Forum Posts

User16826992666
by Valued Contributor
  • 1818 Views
  • 1 replies
  • 0 kudos

Resolved! If I create a clone of a Delta table, does it stay in sync with the original table?

Basically wondering what happens to the clone when updates are made to the original Delta table. Will the changes apply to the cloned table as well?

  • 1818 Views
  • 1 replies
  • 0 kudos
Latest Reply
sajith_appukutt
Honored Contributor II
  • 0 kudos

The clone is not a replica and so updates made to the original delta table wouldn't be applies to the clone. However, shallow clones reference data files in the source directory. If you run vacuum on the source table, clients will no longer be able t...

  • 0 kudos
User16826992666
by Valued Contributor
  • 1284 Views
  • 1 replies
  • 0 kudos

Resolved! I know my partitions are skewed, is there anything I can do to help my performance?

I know the skew in my dataset has the potential to cause issues with my job performance, so just wondering if there is anything I can do to help my performance other than repartitioning the whole dataset.

  • 1284 Views
  • 1 replies
  • 0 kudos
Latest Reply
sajith_appukutt
Honored Contributor II
  • 0 kudos

For scenarios like this, it is recommend to use a cluster with Databricks Runtime 7.3 LTS or above where AQE is enabled. AQE dynamically handles skew in sort merge join and shuffle hash join by splitting (and replicating if needed) skewed tasks into ...

  • 0 kudos
User16826992666
by Valued Contributor
  • 1124 Views
  • 1 replies
  • 0 kudos

Resolved! Do I still need to use skew join hints if I have Adaptive Query Execution enabled?

From what I have read about AQE it seems to do a lot of what skew join hints did automatically. So should I still be using skew hints in my queries? Is there harm in using them?

  • 1124 Views
  • 1 replies
  • 0 kudos
Latest Reply
sajith_appukutt
Honored Contributor II
  • 0 kudos

With AQE Databricks  has the most up-to-date accurate statistics at the end of a query stage and can opt for a better physical strategy and or do optimizations that used to require hints,In the case of skew join hints, is recommended to rely on AQE...

  • 0 kudos
User15787040559
by Databricks Employee
  • 2088 Views
  • 1 replies
  • 0 kudos
  • 2088 Views
  • 1 replies
  • 0 kudos
Latest Reply
sajith_appukutt
Honored Contributor II
  • 0 kudos

In addition to subscription limits, the total capacity of clusters in each workspace is a function of the masks used for the workspace's enclosing Vnet and the pair of subnets associated with each cluster in the workspace. The masks can be changed if...

  • 0 kudos
User16826992666
by Valued Contributor
  • 4250 Views
  • 2 replies
  • 0 kudos

Resolved! Can multiple streams write to a Delta table at the same time?

Wondering if there any dangers to doing this, and if it's a best practice. I'm concerned there could be conflicts but I'm not sure how Delta would handle it.

  • 4250 Views
  • 2 replies
  • 0 kudos
Latest Reply
sajith_appukutt
Honored Contributor II
  • 0 kudos

>Can multiple streams write to a Delta table at the same time?Yes delta uses optimistic concurrency control and configurable isolation levels>I'm concerned there could be conflicts but I'm not sure how Delta would handle it.Write operations can resul...

  • 0 kudos
1 More Replies
User16790091296
by Contributor II
  • 1271 Views
  • 1 replies
  • 0 kudos

What’s the best instance type to run OPTIMIZE (bin-packing and Z-Ordering) on?

I've been doing some research on optimizing data storage while implementing delta, however, I'm not sure which instance type would be best for this.

  • 1271 Views
  • 1 replies
  • 0 kudos
Latest Reply
sajith_appukutt
Honored Contributor II
  • 0 kudos

OPTIMIZE as you alluded has two operations , Bin-packing and multi-dimensional clustering ( zorder)Bin-packing optimization is idempotent, meaning that if it is run twice on the same dataset, the second run has no effectZ-Ordering is not idempotent b...

  • 0 kudos
User16826994223
by Honored Contributor III
  • 1244 Views
  • 3 replies
  • 0 kudos

What is Autolader in Databricks?

Want to Know what is Autoloader and what are its advantages

  • 1244 Views
  • 3 replies
  • 0 kudos
Latest Reply
MoJaMa
Databricks Employee
  • 0 kudos

The biggest advantage is the ease with which you can star ingesting data from your Cloud Storage directly into a Delta Table. You can choose Directory Listing mode or File Notification mode, depending on what fits your use case best.

  • 0 kudos
2 More Replies
MoJaMa
by Databricks Employee
  • 905 Views
  • 1 replies
  • 0 kudos
  • 905 Views
  • 1 replies
  • 0 kudos
Latest Reply
MoJaMa
Databricks Employee
  • 0 kudos

Most possibly in future as we progress down our Roadmap.Currently it is per-workspace, and only accessible in Databricks notebooks/jobs.Please refer to our docs:https://docs.databricks.com/applications/machine-learning/feature-store.html#known-limita...

  • 0 kudos
MoJaMa
by Databricks Employee
  • 972 Views
  • 1 replies
  • 0 kudos
  • 972 Views
  • 1 replies
  • 0 kudos
Latest Reply
MoJaMa
Databricks Employee
  • 0 kudos

It's our new high-performance runtime, using a native vectorized engine developed in C++.Please see our blog for a great overview. https://databricks.com/blog/2021/06/17/announcing-photon-public-preview-the-next-generation-query-engine-on-the-databri...

  • 0 kudos
MoJaMa
by Databricks Employee
  • 834 Views
  • 1 replies
  • 0 kudos
  • 834 Views
  • 1 replies
  • 0 kudos
Latest Reply
MoJaMa
Databricks Employee
  • 0 kudos

We used to require this, but starting June 9, 2021 we no longer do, and have improved our E2 security posture.See https://docs.databricks.com/administration-guide/account-api/iam-role.html for the current permissions required.

  • 0 kudos
MoJaMa
by Databricks Employee
  • 1042 Views
  • 1 replies
  • 1 kudos
  • 1042 Views
  • 1 replies
  • 1 kudos
Latest Reply
MoJaMa
Databricks Employee
  • 1 kudos

Use the calculator herehttps://databricks.com/product/aws-pricing/instance-typesOpen two windows side by side, pick Photon and Non-Photon instances of the same type and compare.

  • 1 kudos
sajith_appukutt
by Honored Contributor II
  • 1254 Views
  • 1 replies
  • 1 kudos

Resolved! Are there any ways to automatically cleanup temporary files created in s3 by the Amazon Redshift connector

The Amazon Redshift data source in Databricks seems to be using S3 for storing intermediate results. Are there any ways to automatically cleanup temporary files created in S3

  • 1254 Views
  • 1 replies
  • 1 kudos
Latest Reply
sajith_appukutt
Honored Contributor II
  • 1 kudos

You could use storage lifecycle policy for the s3 bucket used for storing intermediate results and configure expiration actions. This way temporary/intermediate results would be automatically cleaned up

  • 1 kudos
User16752246553
by New Contributor
  • 1119 Views
  • 1 replies
  • 1 kudos

How does Vectorized Pandas UDF work?

Do Vectorized Pandas UDFs apply to batches of data sequentially or in parallel? And is there a way to set the batch size?

  • 1119 Views
  • 1 replies
  • 1 kudos
Latest Reply
sajith_appukutt
Honored Contributor II
  • 1 kudos

>How does Vectorized Pandas UDF work?Here is a video explaining the internals of Pandas UDFs (a.k.a. Vectorized UDFs) - https://youtu.be/UZl0pHG-2HA?t=123 . They use Apache Arrow, to exchange data directly between JVM and Python driver/executors wit...

  • 1 kudos
User16826992666
by Valued Contributor
  • 2310 Views
  • 1 replies
  • 0 kudos

Resolved! What is the difference between a trigger once stream and a normal one time write?

It seems to me like both of these would accomplish the same thing in the end. Do they use different mechanisms to accomplish it though? Are there any hidden costs to streaming to consider?

  • 2310 Views
  • 1 replies
  • 0 kudos
Latest Reply
Ryan_Chynoweth
Esteemed Contributor
  • 0 kudos

The biggest reason to use the streaming API over the non-stream API would be to enable the checkpoint log to maintain a processing log. It is most common for people to use the trigger once when they want to only process the changes between executions...

  • 0 kudos

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group
Labels