cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

Forum Posts

User16752239289
by Valued Contributor
  • 1762 Views
  • 1 replies
  • 0 kudos

The cluster with the instance profile cannot access the S3 bucket. 403 permission denied is thrown

The document has been followed to configure the instance profile. The ec2 instance is able to access the S3 bucket when configured the same instance profile. However, the cluster configured to use the same instance profile failed to access the S3 buc...

  • 1762 Views
  • 1 replies
  • 0 kudos
Latest Reply
User16752239289
Valued Contributor
  • 0 kudos

I suspect this is due to AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY has been added to the spark environmental variable. You can run %sh env | grep -i aws on your cluster and make sure AWS_ACCESS_KEY_ID is not present. If so, then please remove it e...

  • 0 kudos
sajith_appukutt
by Honored Contributor II
  • 1264 Views
  • 1 replies
  • 0 kudos

Resolved! Re-optimize in delta not splitting large files to smaller files.

I am trying to re-optimize the a delta table with a max file size of 32 MB. But after changing spark.databricks.delta.optimize.maxFileSize and trying to optimize a partition, it doesn't split larger files to smaller ones. How can i get it to work.

  • 1264 Views
  • 1 replies
  • 0 kudos
Latest Reply
sajith_appukutt
Honored Contributor II
  • 0 kudos

spark.databricks.delta.optimize.maxFileSize controls the target size to binpack files when you run OPTIMIZE command. But it will not split larger files to smaller ones today.  File splitting happens when ZORDER is ran however.

  • 0 kudos
sajith_appukutt
by Honored Contributor II
  • 724 Views
  • 1 replies
  • 0 kudos
  • 724 Views
  • 1 replies
  • 0 kudos
Latest Reply
sajith_appukutt
Honored Contributor II
  • 0 kudos

You could leverage SHOW GRANT which displays the privilegesSHOW GRANT [<user>] ON [CATALOG | DATABASE <database-name> | TABLE <table-name> | VIEW <view-name> | FUNCTION <function-name> | ANONYMOUS FUNCTION | ANY FILE]You could use this code snippet ...

  • 0 kudos
sajith_appukutt
by Honored Contributor II
  • 766 Views
  • 1 replies
  • 0 kudos

Resolved! MERGE operation on PI data getting slower. How can I debug?

We have a structured streaming job configured to read from event-hub and persist to the delta raw/bronze layer via MERGE inside a foreachBatch, However of-late, the merge process is taking longer time. How can i optimize this pipeline ?

  • 766 Views
  • 1 replies
  • 0 kudos
Latest Reply
sajith_appukutt
Honored Contributor II
  • 0 kudos

Delta Lake completes a  MERGE  in two stepsPerform an inner join between the target table and source table to select all files that have matches.Perform an outer join between the selected files in the target and source tables and write out the update...

  • 0 kudos
Anonymous
by Not applicable
  • 477 Views
  • 0 replies
  • 0 kudos

What is Auto auto-logging?

How is it different from regular autologging? When should I consider enabling Auto autologging ? How can I switch the feature on?

  • 477 Views
  • 0 replies
  • 0 kudos
Anonymous
by Not applicable
  • 737 Views
  • 1 replies
  • 1 kudos
  • 737 Views
  • 1 replies
  • 1 kudos
Latest Reply
sajith_appukutt
Honored Contributor II
  • 1 kudos

mlflow is an open source framework and you could pip install mlflow in your laptop for example. https://mlflow.org/docs/latest/quickstart.html

  • 1 kudos
User16826987838
by Contributor
  • 778 Views
  • 2 replies
  • 0 kudos
  • 778 Views
  • 2 replies
  • 0 kudos
Latest Reply
sajith_appukutt
Honored Contributor II
  • 0 kudos

def getVaccumSize(table: String): Long = { val listFiles = spark.sql(s"VACUUM $table DRY RUN").select("path").collect().map(_(0)).toList var sum = 0L listFiles.foreach(x => sum += dbutils.fs.ls(x.toString)(0).size) sum }   getVaccumSize("<yo...

  • 0 kudos
1 More Replies
r_van_niekerk
by New Contributor II
  • 1497 Views
  • 2 replies
  • 1 kudos

I have a multi-part question around Databricks integration with Splunk?

Use Case BackgroundWe have an ongoing SecOps project going live here in 4 weeks. We have set up a Splunk to monitor syslogs logs and want to integrate this with Delta. Our forwarder collect the data from remote machines then forwards data to the inde...

  • 1497 Views
  • 2 replies
  • 1 kudos
Latest Reply
aladda
Honored Contributor II
  • 1 kudos

The Databricks Add-on for Splunk built as part of Databricks Labs can be leveraged for Splunk integrationIt’s a bi-directional framework that allows for in-place querying of data in databricks from within Splunk by running queries, notebooks or jobs ...

  • 1 kudos
1 More Replies
User16826994223
by Honored Contributor III
  • 494 Views
  • 1 replies
  • 0 kudos

Deltalive table Cluster

I want to understand more about the delta live cluster, the cluster when it starts we do not have visibility, I also heard that operational task like Optimize can happen on other cluster, living original cluster for only main work of data proces...

  • 494 Views
  • 1 replies
  • 0 kudos
Latest Reply
aladda
Honored Contributor II
  • 0 kudos

Delta Live Table Pipeline definition has a place to define the cluster configuration. DLT execution is encapsulated in the Pipeline and you're monitoring the overall pipeline which is the higher order function vs having to monitor the cluster itself

  • 0 kudos
User16826987838
by Contributor
  • 934 Views
  • 1 replies
  • 0 kudos
  • 934 Views
  • 1 replies
  • 0 kudos
Latest Reply
aladda
Honored Contributor II
  • 0 kudos

Databricks recommends launching the cluster so that the Spark driver is on an on-demand instance, which allows saving the state of the cluster even after losing spot instance nodes. If you choose to use all spot instances including the driver, any ca...

  • 0 kudos
Anonymous
by Not applicable
  • 703 Views
  • 1 replies
  • 0 kudos
  • 703 Views
  • 1 replies
  • 0 kudos
Latest Reply
aladda
Honored Contributor II
  • 0 kudos

For Delta in general having Delta cache accelerates data reads by creating copies of remote files in nodes’ local storage using a fast intermediate data format. The data is cached automatically whenever a file has to be fetched from a remote locatio...

  • 0 kudos
aladda
by Honored Contributor II
  • 959 Views
  • 1 replies
  • 0 kudos

Resolved! I read that Delta supports concurrent writes to separate partitions of the table but I'm getting an error when doing so

I’m running 3 separate dbt processes in parallel. all of them are reading data from different databrick databases, creating different staging tables by using dbt alias, but they all at the end update/insert to the same target table. the 3 processes r...

  • 959 Views
  • 1 replies
  • 0 kudos
Latest Reply
aladda
Honored Contributor II
  • 0 kudos

You’re likely running into the issue described here and a solution to it as well. While Delta does support concurrent writers to separate partitions of a table, depending on your query structure join/filter/where in particular, there may still be a n...

  • 0 kudos
aladda
by Honored Contributor II
  • 3876 Views
  • 1 replies
  • 1 kudos
  • 3876 Views
  • 1 replies
  • 1 kudos
Latest Reply
aladda
Honored Contributor II
  • 1 kudos

The Databricks Add-on for Splunk built as part of Databricks Labs can be leveraged for Splunk integrationIt’s a bi-directional framework that allows for in-place querying of data in databricks from within Splunk by running queries, notebooks or jobs ...

  • 1 kudos
Anonymous
by Not applicable
  • 4331 Views
  • 1 replies
  • 1 kudos

Resolved! Jobs - Delta Live tables difference

Can you please explain the difference between Jobs and Delta Live tables?

  • 4331 Views
  • 1 replies
  • 1 kudos
Latest Reply
aladda
Honored Contributor II
  • 1 kudos

Jobs are designed for automated execution (scheduled or manually) of Databricks Notebooks, JARs, spark-submit jobs etc. Its essentially a generic framework to run any kind of Data Engg, Data Analysis or Data Science workload. Delta Live Tables on the...

  • 1 kudos
Labels
Top Kudoed Authors