cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Forum Posts

User16826994223
by Honored Contributor III
  • 973 Views
  • 1 replies
  • 0 kudos

Delta Table to Spark Streaming to Synapse Table in azure databricks

Is there a way to keep my synapse database always in sync with latest data from delta table, My synapse database I believe doesn't support the stream as sink, can i get any workaround

  • 973 Views
  • 1 replies
  • 0 kudos
Latest Reply
User16826994223
Honored Contributor III
  • 0 kudos

You could try to keep the data in sync by appending the new data dataframe in a forEachBatch on your write stream, this method allows for arbitrary ways to write data, you can connect to the Datawarehouse with jdbc if necessary:with your batch functi...

  • 0 kudos
brickster_2018
by Esteemed Contributor
  • 1029 Views
  • 1 replies
  • 0 kudos

Resolved! Is there any way to control the autoOptimize interval?

I can see my streaming jobs running optimize jobs more frequently, Is there any property so I can control autoOptimize duration

  • 1029 Views
  • 1 replies
  • 0 kudos
Latest Reply
brickster_2018
Esteemed Contributor
  • 0 kudos

The autoOptimize is not performed on a time basis. It's an event-based trigger. Once the delta table/partition has 50 (default value of spark.databricks.delta.autoCompact.minNumFiles) files, auto-compaction is triggered. To reduce the frequency, inc...

  • 0 kudos
User16826992666
by Valued Contributor
  • 1370 Views
  • 1 replies
  • 0 kudos
  • 1370 Views
  • 1 replies
  • 0 kudos
Latest Reply
brickster_2018
Esteemed Contributor
  • 0 kudos

To time travel to a particular version, it's necessary to have the JSON file for that particular version. the JSON files in the delta_log have default retention of 30 days. So by default, we can time travel only up to 30 days. The retention of the D...

  • 0 kudos
brickster_2018
by Esteemed Contributor
  • 3272 Views
  • 1 replies
  • 0 kudos

Resolved! Unable to overwrite the schema of a Delta table

As per the docs, I can overwrite the schema of a Delta table using the "overWriteSchema" option. But i am unable to overwrite the schema for a Delta table.

  • 3272 Views
  • 1 replies
  • 0 kudos
Latest Reply
brickster_2018
Esteemed Contributor
  • 0 kudos

When Table ACLs are enabled, we can't change the schema of an operation through a write, which requires * MODIFY permissions, when schema changes require OWN permissions. Hence overwriting schema is not supported when Table ACL is enabled for the D...

  • 0 kudos
brickster_2018
by Esteemed Contributor
  • 4263 Views
  • 1 replies
  • 0 kudos
  • 4263 Views
  • 1 replies
  • 0 kudos
Latest Reply
brickster_2018
Esteemed Contributor
  • 0 kudos

The below code can be used to get the number of records in a Delta table without querying it%scala import com.databricks.sql.transaction.tahoe.DeltaLog import org.apache.hadoop.fs.Path import org.apache.spark.sql.DataFrame import org.apache.spark.sql...

  • 0 kudos
aladda
by Honored Contributor II
  • 1133 Views
  • 0 replies
  • 0 kudos

What are the recommendations around collecting stats on long strings in a Delta Table

It is best to avoid collecting stats on long strings. You typically want to collect stats on column that are used in filter, where clauses, joins and on which you tend to performance aggregations - typically numerical valuesYou can avoid collecting s...

  • 1133 Views
  • 0 replies
  • 0 kudos
aladda
by Honored Contributor II
  • 4337 Views
  • 1 replies
  • 0 kudos
  • 4337 Views
  • 1 replies
  • 0 kudos
Latest Reply
aladda
Honored Contributor II
  • 0 kudos

by default a delta table has stats collected on the first 32 columns. This setting can be configured using the following.set spark.databricks.delta.properties.defaults.dataSkippingNumIndexedCols = 3However there's a time trade-off to having a large n...

  • 0 kudos
aladda
by Honored Contributor II
  • 840 Views
  • 1 replies
  • 0 kudos
  • 840 Views
  • 1 replies
  • 0 kudos
Latest Reply
aladda
Honored Contributor II
  • 0 kudos

Its typically a good idea to run optimize aligned with the frequency of updates to the Delta Table. However you also don't want to over do as there's a cost/performance trade-off. Unless there are very frequent updates to the table that can cause sma...

  • 0 kudos
aladda
by Honored Contributor II
  • 1045 Views
  • 1 replies
  • 0 kudos
  • 1045 Views
  • 1 replies
  • 0 kudos
Latest Reply
aladda
Honored Contributor II
  • 0 kudos

Optimize merges small files into larger ones and can involve shuffling and creation of large in-memory partitions. Thus its recommended to use a memory optimized executor configuration to prevent spilling to disk. IN additional use of autoscaling wil...

  • 0 kudos
User16765131552
by Contributor III
  • 3116 Views
  • 3 replies
  • 0 kudos

COPY INTO: How to add a partitioning?

The command COPY INTO from Databricks provides an idempotent file ingestion into a delta table, see here. From the docs, an example command looks like this:COPY INTO delta.`target_path` FROM (SELECT key, index, textData, 'constant_value' FROM 'sour...

  • 3116 Views
  • 3 replies
  • 0 kudos
Latest Reply
Mooune_DBU
Valued Contributor
  • 0 kudos

If you're looking to partition your `target_path` table, then it's recommended to define the partition keys prior to the COPY INTO command (at the DDL level)E.g.// Drop table if it already exists without the partition key defined (OPTIONNAL) DROP TAB...

  • 0 kudos
2 More Replies
Anonymous
by Not applicable
  • 1231 Views
  • 2 replies
  • 0 kudos

Changing default Delta behavior in DBR 8.x for writes

Is there anyway to add a Spark Config that reverts the default behavior when doing tables writes from Delta to Parquet in DBR 8.0+? I know you can simply specify .format("parquet") but that could involve a decent amount of code change for some client...

  • 1231 Views
  • 2 replies
  • 0 kudos
Latest Reply
Anonymous
Not applicable
  • 0 kudos

Thanks @Ryan Chynoweth​ !

  • 0 kudos
1 More Replies
User16826987838
by Contributor
  • 1149 Views
  • 2 replies
  • 0 kudos
  • 1149 Views
  • 2 replies
  • 0 kudos
Latest Reply
sajith_appukutt
Honored Contributor II
  • 0 kudos

def getVaccumSize(table: String): Long = { val listFiles = spark.sql(s"VACUUM $table DRY RUN").select("path").collect().map(_(0)).toList var sum = 0L listFiles.foreach(x => sum += dbutils.fs.ls(x.toString)(0).size) sum }   getVaccumSize("<yo...

  • 0 kudos
1 More Replies
Anonymous
by Not applicable
  • 1407 Views
  • 1 replies
  • 0 kudos
  • 1407 Views
  • 1 replies
  • 0 kudos
Latest Reply
Ryan_Chynoweth
Esteemed Contributor
  • 0 kudos

Delta Lake uses optimistic concurrency control to provide transactional guarantees between writes. Under this mechanism, writes operate in three stages:Read: Reads (if needed) the latest available version of the table to identify which files need to ...

  • 0 kudos
Labels