cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Forum Posts

User16826994223
by Honored Contributor III
  • 6942 Views
  • 1 replies
  • 0 kudos

Resolved! How Azure Databricks manages network security group rules

How Azure Databricks manages network security group rules

  • 6942 Views
  • 1 replies
  • 0 kudos
Latest Reply
User16826994223
Honored Contributor III
  • 0 kudos

The NSG rules listed in the following sections represent those that Azure Databricks auto-provisions and manages in your NSG, by virtue of the delegation of your VNet’s host and container subnets to the Microsoft.Databricks/workspaces service. You do...

  • 0 kudos
User16826994223
by Honored Contributor III
  • 4740 Views
  • 0 replies
  • 0 kudos

Virtual network requirements inAzure (V net Injection) The VNet that you deploy your Azure Databricks workspace to must meet the following requirement...

Virtual network requirements inAzure (V net Injection)The VNet that you deploy your Azure Databricks workspace to must meet the following requirements:Region: The VNet must reside in the same region as the Azure Databricks workspace.Subscription: The...

  • 4740 Views
  • 0 replies
  • 0 kudos
User16826994223
by Honored Contributor III
  • 1934 Views
  • 0 replies
  • 0 kudos

Benefits of using Vnet injection in Azure Databricks  Connect Azure Databricks to other Azure services (such as Azure Storage) in a more secure manne...

Benefits of using Vnet injection in Azure Databricks Connect Azure Databricks to other Azure services (such as Azure Storage) in a more secure manner using service endpoints or private endpoints.Connect to on-premises data sources for use with Azure...

  • 1934 Views
  • 0 replies
  • 0 kudos
sajith_appukutt
by Honored Contributor II
  • 2729 Views
  • 1 replies
  • 0 kudos

Resolved! I'm using the Redshift data source to load data into spark SQL data frames. However, I'm not seeing predicate push down for my queries ran on Redshift - is that expected?

I was expecting filter operations to be pushed down to Redshift by the optimizer. However, the entire dataset is getting loaded from Redshift.

  • 2729 Views
  • 1 replies
  • 0 kudos
Latest Reply
sajith_appukutt
Honored Contributor II
  • 0 kudos

The Spark driver for Redshift pushes the following operators down into Redshift:FilterProjectSortLimitAggregationJoinHowever, it does not support expressions operating on dates and timestamps today. If you have a similar requirement, please add a fea...

  • 0 kudos
User16752239289
by Databricks Employee
  • 3005 Views
  • 1 replies
  • 0 kudos

The cluster with the instance profile cannot access the S3 bucket. 403 permission denied is thrown

The document has been followed to configure the instance profile. The ec2 instance is able to access the S3 bucket when configured the same instance profile. However, the cluster configured to use the same instance profile failed to access the S3 buc...

  • 3005 Views
  • 1 replies
  • 0 kudos
Latest Reply
User16752239289
Databricks Employee
  • 0 kudos

I suspect this is due to AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY has been added to the spark environmental variable. You can run %sh env | grep -i aws on your cluster and make sure AWS_ACCESS_KEY_ID is not present. If so, then please remove it e...

  • 0 kudos
sajith_appukutt
by Honored Contributor II
  • 2750 Views
  • 1 replies
  • 0 kudos

Resolved! Re-optimize in delta not splitting large files to smaller files.

I am trying to re-optimize the a delta table with a max file size of 32 MB. But after changing spark.databricks.delta.optimize.maxFileSize and trying to optimize a partition, it doesn't split larger files to smaller ones. How can i get it to work.

  • 2750 Views
  • 1 replies
  • 0 kudos
Latest Reply
sajith_appukutt
Honored Contributor II
  • 0 kudos

spark.databricks.delta.optimize.maxFileSize controls the target size to binpack files when you run OPTIMIZE command. But it will not split larger files to smaller ones today.  File splitting happens when ZORDER is ran however.

  • 0 kudos
sajith_appukutt
by Honored Contributor II
  • 1868 Views
  • 1 replies
  • 0 kudos
  • 1868 Views
  • 1 replies
  • 0 kudos
Latest Reply
sajith_appukutt
Honored Contributor II
  • 0 kudos

You could leverage SHOW GRANT which displays the privilegesSHOW GRANT [<user>] ON [CATALOG | DATABASE <database-name> | TABLE <table-name> | VIEW <view-name> | FUNCTION <function-name> | ANONYMOUS FUNCTION | ANY FILE]You could use this code snippet ...

  • 0 kudos
sajith_appukutt
by Honored Contributor II
  • 1977 Views
  • 1 replies
  • 0 kudos

Resolved! MERGE operation on PI data getting slower. How can I debug?

We have a structured streaming job configured to read from event-hub and persist to the delta raw/bronze layer via MERGE inside a foreachBatch, However of-late, the merge process is taking longer time. How can i optimize this pipeline ?

  • 1977 Views
  • 1 replies
  • 0 kudos
Latest Reply
sajith_appukutt
Honored Contributor II
  • 0 kudos

Delta Lake completes a  MERGE  in two stepsPerform an inner join between the target table and source table to select all files that have matches.Perform an outer join between the selected files in the target and source tables and write out the update...

  • 0 kudos
Anonymous
by Not applicable
  • 1234 Views
  • 0 replies
  • 0 kudos

What is Auto auto-logging?

How is it different from regular autologging? When should I consider enabling Auto autologging ? How can I switch the feature on?

  • 1234 Views
  • 0 replies
  • 0 kudos
Anonymous
by Not applicable
  • 1851 Views
  • 1 replies
  • 1 kudos
  • 1851 Views
  • 1 replies
  • 1 kudos
Latest Reply
sajith_appukutt
Honored Contributor II
  • 1 kudos

mlflow is an open source framework and you could pip install mlflow in your laptop for example. https://mlflow.org/docs/latest/quickstart.html

  • 1 kudos
User16826987838
by Contributor
  • 1906 Views
  • 2 replies
  • 0 kudos
  • 1906 Views
  • 2 replies
  • 0 kudos
Latest Reply
sajith_appukutt
Honored Contributor II
  • 0 kudos

def getVaccumSize(table: String): Long = { val listFiles = spark.sql(s"VACUUM $table DRY RUN").select("path").collect().map(_(0)).toList var sum = 0L listFiles.foreach(x => sum += dbutils.fs.ls(x.toString)(0).size) sum }   getVaccumSize("<yo...

  • 0 kudos
1 More Replies
r_van_niekerk
by Databricks Employee
  • 2917 Views
  • 2 replies
  • 1 kudos

I have a multi-part question around Databricks integration with Splunk?

Use Case BackgroundWe have an ongoing SecOps project going live here in 4 weeks. We have set up a Splunk to monitor syslogs logs and want to integrate this with Delta. Our forwarder collect the data from remote machines then forwards data to the inde...

  • 2917 Views
  • 2 replies
  • 1 kudos
Latest Reply
aladda
Databricks Employee
  • 1 kudos

The Databricks Add-on for Splunk built as part of Databricks Labs can be leveraged for Splunk integrationIt’s a bi-directional framework that allows for in-place querying of data in databricks from within Splunk by running queries, notebooks or jobs ...

  • 1 kudos
1 More Replies
User16826994223
by Honored Contributor III
  • 1026 Views
  • 1 replies
  • 0 kudos

Deltalive table Cluster

I want to understand more about the delta live cluster, the cluster when it starts we do not have visibility, I also heard that operational task like Optimize can happen on other cluster, living original cluster for only main work of data proces...

  • 1026 Views
  • 1 replies
  • 0 kudos
Latest Reply
aladda
Databricks Employee
  • 0 kudos

Delta Live Table Pipeline definition has a place to define the cluster configuration. DLT execution is encapsulated in the Pipeline and you're monitoring the overall pipeline which is the higher order function vs having to monitor the cluster itself

  • 0 kudos
User16826987838
by Contributor
  • 2788 Views
  • 1 replies
  • 0 kudos
  • 2788 Views
  • 1 replies
  • 0 kudos
Latest Reply
aladda
Databricks Employee
  • 0 kudos

Databricks recommends launching the cluster so that the Spark driver is on an on-demand instance, which allows saving the state of the cluster even after losing spot instance nodes. If you choose to use all spot instances including the driver, any ca...

  • 0 kudos

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!

Sign Up Now
Labels