Data Engineering

Forum Posts

Sorted by:

by User16752239289 • Valued Contributor

06-21-2021 4:05:50 PM

1762 Views
1 replies
0 kudos

The cluster with the instance profile cannot access the S3 bucket. 403 permission denied is thrown

The document has been followed to configure the instance profile. The ec2 instance is able to access the S3 bucket when configured the same instance profile. However, the cluster configured to use the same instance profile failed to access the S3 buc...

Data Engineering

1762 Views
1 replies
0 kudos

06-21-2021 4:05:50 PM

View Replies

Latest Reply

User16752239289
Valued Contributor

06-21-2021 4:14:16 PM

0 kudos

I suspect this is due to AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY has been added to the spark environmental variable. You can run %sh env | grep -i aws on your cluster and make sure AWS_ACCESS_KEY_ID is not present. If so, then please remove it e...

0 kudos

06-21-2021 4:14:16 PM

by sajith_appukutt • Honored Contributor II

06-09-2021 12:07:20 AM

1264 Views
1 replies
0 kudos

Resolved! Re-optimize in delta not splitting large files to smaller files.

I am trying to re-optimize the a delta table with a max file size of 32 MB. But after changing spark.databricks.delta.optimize.maxFileSize and trying to optimize a partition, it doesn't split larger files to smaller ones. How can i get it to work.

Data Engineering

1264 Views
1 replies
0 kudos

06-09-2021 12:07:20 AM

View Replies

Latest Reply

sajith_appukutt
Honored Contributor II

06-21-2021 3:21:11 PM

0 kudos

spark.databricks.delta.optimize.maxFileSize controls the target size to binpack files when you run OPTIMIZE command. But it will not split larger files to smaller ones today. File splitting happens when ZORDER is ran however.

0 kudos

06-21-2021 3:21:11 PM

by sajith_appukutt • Honored Contributor II

06-08-2021 11:49:34 PM

724 Views
1 replies
0 kudos

Resolved! How can i generate an audit report that lists all Data object privileges in my catalog

Data Engineering

724 Views
1 replies
0 kudos

06-08-2021 11:49:34 PM

View Replies

Latest Reply

sajith_appukutt
Honored Contributor II

06-21-2021 3:14:46 PM

0 kudos

0 kudos

06-21-2021 3:14:46 PM

by sajith_appukutt • Honored Contributor II

06-13-2021 4:55:00 PM

766 Views
1 replies
0 kudos

Resolved! MERGE operation on PI data getting slower. How can I debug?

We have a structured streaming job configured to read from event-hub and persist to the delta raw/bronze layer via MERGE inside a foreachBatch, However of-late, the merge process is taking longer time. How can i optimize this pipeline ?

Data Engineering

766 Views
1 replies
0 kudos

06-13-2021 4:55:00 PM

View Replies

Latest Reply

sajith_appukutt
Honored Contributor II

06-21-2021 3:08:55 PM

0 kudos

Delta Lake completes a MERGE in two stepsPerform an inner join between the target table and source table to select all files that have matches.Perform an outer join between the selected files in the target and source tables and write out the update...

0 kudos

06-21-2021 3:08:55 PM

by Anonymous • Not applicable

06-21-2021 2:29:45 PM

477 Views
0 replies
0 kudos

What is Auto auto-logging?

How is it different from regular autologging? When should I consider enabling Auto autologging ? How can I switch the feature on?

Data Engineering

477 Views
0 replies
0 kudos

06-21-2021 2:29:45 PM

by Anonymous • Not applicable

06-21-2021 1:53:12 PM

737 Views
1 replies
1 kudos

Resolved! Can I use mlflow locally on my machine or does it always have to be through Databricks?

Would it require DB connect / DB CLI / API?

Data Engineering

737 Views
1 replies
1 kudos

06-21-2021 1:53:12 PM

View Replies

Latest Reply

sajith_appukutt
Honored Contributor II

06-21-2021 2:15:45 PM

1 kudos

mlflow is an open source framework and you could pip install mlflow in your laptop for example. https://mlflow.org/docs/latest/quickstart.html

1 kudos

06-21-2021 2:15:45 PM

by User16826987838 • Contributor

06-18-2021 2:07:31 PM

778 Views
2 replies
0 kudos

How do I get the size of files cleaned up by a vacuum for a Delta table.

Data Engineering

778 Views
2 replies
0 kudos

06-18-2021 2:07:31 PM

View Replies

Latest Reply

sajith_appukutt
Honored Contributor II

06-21-2021 2:06:16 PM

0 kudos

def getVaccumSize(table: String): Long = { val listFiles = spark.sql(s"VACUUM $table DRY RUN").select("path").collect().map(_(0)).toList var sum = 0L listFiles.foreach(x => sum += dbutils.fs.ls(x.toString)(0).size) sum } getVaccumSize("<yo...

0 kudos

06-21-2021 2:06:16 PM

1 More Replies

by User16826987838 • Contributor

06-18-2021 5:47:55 PM

636 Views
2 replies
0 kudos

Is it possible to increase the maximum number of secret scopes (currently 100) for a workspace

Data Engineering

636 Views
2 replies
0 kudos

06-18-2021 5:47:55 PM

View Replies

Latest Reply

Anonymous
Not applicable

06-21-2021 1:36:36 PM

0 kudos

You can file a Tech Support ticket requesting for it. No need for ES ticket.

0 kudos

06-21-2021 1:36:36 PM

1 More Replies

by r_van_niekerk • New Contributor II

06-07-2021 11:22:53 AM

1497 Views
2 replies
1 kudos

I have a multi-part question around Databricks integration with Splunk?

Use Case BackgroundWe have an ongoing SecOps project going live here in 4 weeks. We have set up a Splunk to monitor syslogs logs and want to integrate this with Delta. Our forwarder collect the data from remote machines then forwards data to the inde...

Data Engineering

1497 Views
2 replies
1 kudos

06-07-2021 11:22:53 AM

View Replies

Latest Reply

aladda
Honored Contributor II

06-21-2021 1:28:46 PM

1 kudos

The Databricks Add-on for Splunk built as part of Databricks Labs can be leveraged for Splunk integrationIt’s a bi-directional framework that allows for in-place querying of data in databricks from within Splunk by running queries, notebooks or jobs ...

1 kudos

06-21-2021 1:28:46 PM

1 More Replies

by User16826994223 • Honored Contributor III

06-17-2021 12:33:41 AM

494 Views
1 replies
0 kudos

Deltalive table Cluster

I want to understand more about the delta live cluster, the cluster when it starts we do not have visibility, I also heard that operational task like Optimize can happen on other cluster, living original cluster for only main work of data proces...

Data Engineering

494 Views
1 replies
0 kudos

06-17-2021 12:33:41 AM

View Replies

Latest Reply

aladda
Honored Contributor II

06-21-2021 1:26:30 PM

0 kudos

Delta Live Table Pipeline definition has a place to define the cluster configuration. DLT execution is encapsulated in the Pipeline and you're monitoring the overall pipeline which is the higher order function vs having to monitor the cluster itself

0 kudos

06-21-2021 1:26:30 PM

by User16826987838 • Contributor

06-18-2021 2:14:35 PM

934 Views
1 replies
0 kudos

Any insights on having a mixed distribution of spot vs on-demand instances within worker nodes?

Data Engineering

934 Views
1 replies
0 kudos

06-18-2021 2:14:35 PM

View Replies

Latest Reply

aladda
Honored Contributor II

06-21-2021 1:22:10 PM

0 kudos

Databricks recommends launching the cluster so that the Spark driver is on an on-demand instance, which allows saving the state of the cluster even after losing spot instance nodes. If you choose to use all spot instances including the driver, any ca...

0 kudos

06-21-2021 1:22:10 PM

by Anonymous • Not applicable

06-18-2021 2:15:42 PM

703 Views
1 replies
0 kudos

Resolved! Any recommendations on instance type for z-order / vacuum/ optimize ?

Data Engineering

703 Views
1 replies
0 kudos

06-18-2021 2:15:42 PM

View Replies

Latest Reply

aladda
Honored Contributor II

06-21-2021 1:20:44 PM

0 kudos

For Delta in general having Delta cache accelerates data reads by creating copies of remote files in nodes’ local storage using a fast intermediate data format. The data is cached automatically whenever a file has to be fetched from a remote locatio...

0 kudos

06-21-2021 1:20:44 PM

by aladda • Honored Contributor II

06-21-2021 1:09:36 PM

959 Views
1 replies
0 kudos

Resolved! I read that Delta supports concurrent writes to separate partitions of the table but I'm getting an error when doing so

I’m running 3 separate dbt processes in parallel. all of them are reading data from different databrick databases, creating different staging tables by using dbt alias, but they all at the end update/insert to the same target table. the 3 processes r...

Data Engineering

959 Views
1 replies
0 kudos

06-21-2021 1:09:36 PM

View Replies

Latest Reply

aladda
Honored Contributor II

06-21-2021 1:10:01 PM

0 kudos

You’re likely running into the issue described here and a solution to it as well. While Delta does support concurrent writers to separate partitions of a table, depending on your query structure join/filter/where in particular, there may still be a n...

0 kudos

06-21-2021 1:10:01 PM

by aladda • Honored Contributor II

06-21-2021 1:05:23 PM

3876 Views
1 replies
1 kudos

Resolved! Does Databricks integrate with Splunk? What are some ways to send metrics/logs to Splunk

Data Engineering

3876 Views
1 replies
1 kudos

06-21-2021 1:05:23 PM

View Replies

Latest Reply

aladda
Honored Contributor II

06-21-2021 1:05:50 PM

1 kudos

1 kudos

06-21-2021 1:05:50 PM

by Anonymous • Not applicable

06-05-2021 10:00:14 PM

4331 Views
1 replies
1 kudos

Resolved! Jobs - Delta Live tables difference

Can you please explain the difference between Jobs and Delta Live tables?

Data Engineering

4331 Views
1 replies
1 kudos

06-05-2021 10:00:14 PM

View Replies

Latest Reply

aladda
Honored Contributor II

06-21-2021 1:03:21 PM

1 kudos

Jobs are designed for automated execution (scheduled or manually) of Databricks Notebooks, JARs, spark-submit jobs etc. Its essentially a generic framework to run any kind of Data Engg, Data Analysis or Data Science workload. Delta Live Tables on the...

1 kudos

06-21-2021 1:03:21 PM

User

Count

1602

736

343

284

247

Databricks

Forum Posts

The cluster with the instance profile cannot access the S3 bucket. 403 permission denied is thrown

Resolved! Re-optimize in delta not splitting large files to smaller files.

Resolved! How can i generate an audit report that lists all Data object privileges in my catalog

Resolved! MERGE operation on PI data getting slower. How can I debug?

What is Auto auto-logging?

Resolved! Can I use mlflow locally on my machine or does it always have to be through Databricks?

How do I get the size of files cleaned up by a vacuum for a Delta table.

Is it possible to increase the maximum number of secret scopes (currently 100) for a workspace

I have a multi-part question around Databricks integration with Splunk?

Deltalive table Cluster

Any insights on having a mixed distribution of spot vs on-demand instances within worker nodes?

Resolved! Any recommendations on instance type for z-order / vacuum/ optimize ?

Resolved! I read that Delta supports concurrent writes to separate partitions of the table but I'm getting an error when doing so

Resolved! Does Databricks integrate with Splunk? What are some ways to send metrics/logs to Splunk

Resolved! Jobs - Delta Live tables difference

Best way to parse Google Analytics data in Databri...

DELTA_EXCEED_CHAR_VARCHAR_LIMIT

Not able to set run_as service_principal_name

Pyspark operations slowness in CLuster 14.3LTS as ...

[Databricks Assets Bundles] Workflow trigger on fi...