cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Forum Posts

smoortema
by Contributor
  • 8 Views
  • 1 replies
  • 0 kudos

how to know which join type was used (broadcast, shuffle hash or sort merge join) for a query?

What is the best way to know what kind of join was used for a SQL query between broadcast, shuffle hash and sort merge? How can the spark UI or the query plan be interpreted?

  • 8 Views
  • 1 replies
  • 0 kudos
Latest Reply
Louis_Frolio
Databricks Employee
  • 0 kudos

Hello @smoortema , here are some helpful tips and tricks.   Here’s how to quickly determine which join strategy Spark used—between broadcast hash join, shuffle hash join, and sort-merge join—and how to read both the query plan and the Spark UI to ver...

  • 0 kudos
Shivaprasad
by New Contributor III
  • 20 Views
  • 0 replies
  • 0 kudos

Error while creating databricks custom app

I am trying to create a simple databricks custom app but I am getting Error: Could not import 'app'. error.app.yaml fileenv: - name: FLASK_APP   value: '/Workspace/Users/sam@xxx.com/databricks_apps/hello-world_2025_11_13-16_19/Gaap_commentry/app'comm...

  • 20 Views
  • 0 replies
  • 0 kudos
DataRabbit
by New Contributor II
  • 21874 Views
  • 5 replies
  • 0 kudos

Resolved! py4j.security.Py4JSecurityException: Constructor public org.apache.spark.ml.feature.VectorAssembler(java.lang.String) is not whitelisted.

Hello, i have a problem.When I try to run the MLlib Assembler (from pyspark.ml.feature import VectorAssembler) I get this error and I don't know what to do anymore. Please help.

  • 21874 Views
  • 5 replies
  • 0 kudos
Latest Reply
VenuG
New Contributor III
  • 0 kudos

Do you plan to support this in Serverless Free Edition? Migration from Community Edition to Serveless has been fraught with these limitations.

  • 0 kudos
4 More Replies
Pratikmsbsvm
by Contributor
  • 24 Views
  • 2 replies
  • 1 kudos

How to Design a Data Quality Framework for Medallion Architecture Data Pipeline

Hello,I am building a Data Pipeline which extract data from Oracle Fusion and Push it to Databricks Delta lake.I am using Bronze, Silver and Gold Approach.May someone please help me how to control all three segment that is Bronze, Silver and Gold wit...

  • 24 Views
  • 2 replies
  • 1 kudos
Latest Reply
nayan_wylde
Esteemed Contributor
  • 1 kudos

Here’s how you can implement DQ at each stage:Bronze LayerChecks:File format validation (CSV, JSON, etc.).Schema validation (column names, types).Row count vs. source system.Tools:Use Databricks Autoloader with schema evolution and badRecordsPathImpl...

  • 1 kudos
1 More Replies
Shalabh007
by Honored Contributor
  • 8884 Views
  • 6 replies
  • 19 kudos

Practice Exams for Databricks Certified Data Engineer Professional exam

Can anyone help with official Practice Exams set for Databricks Certified Data Engineer Professional exam, like we have below for Databricks Certified Data Engineer AssociatePractice exam for the Databricks Certified Data Engineer Associate exam

  • 8884 Views
  • 6 replies
  • 19 kudos
Latest Reply
JOHNBOSCOW23
  • 19 kudos

I Passed my Exam today thanks

  • 19 kudos
5 More Replies
anhnnguyen
by Visitor
  • 37 Views
  • 2 replies
  • 1 kudos

Adding maven dependency to ETL pipeline

Hello guys,I'm building ETL pipeline and need to access HANA data lake file system. In order to do that I need to have sap-hdlfs library in compute environment, library is available in maven repository.My job will have multiple notebook task and ETL ...

anhnnguyen_0-1763437214864.png
  • 37 Views
  • 2 replies
  • 1 kudos
Latest Reply
nayan_wylde
Esteemed Contributor
  • 1 kudos

DLT doesn’t have a UI for library installation, but you can:Use libraries configuration in the pipeline JSON or YAML spec:{ "libraries": [ { "maven": { "coordinates": "com.sap.hana.hadoop:sap-hdlfs:<version>" } } ] }Or...

  • 1 kudos
1 More Replies
Andolina1
by New Contributor III
  • 2889 Views
  • 6 replies
  • 1 kudos

How to trigger an Azure Data Factory pipeline through API using parameters

Hello All,I have a use case where I want to trigger an Azure Data Factory pipeline through API. Right now I am calling the API in Databricks and using Service Principal(token based) to connect to ADF from Databricks.The ADF pipeline has some paramete...

  • 2889 Views
  • 6 replies
  • 1 kudos
Latest Reply
rfranco
Visitor
  • 1 kudos

Hello @Andolina1,try to send your payload like:body = {'curr_working_user': f'{parameters}'}response = requests.post(url, headers=headers, json=body)the pipeline's parameter should be named curr_working_user. With these changes your setup should work...

  • 1 kudos
5 More Replies
Zbyszek
by New Contributor
  • 18 Views
  • 1 replies
  • 0 kudos

Create a Hudi table with Databrick 17

Hi I'm trying to run my existing code which has worked on the older DB version.CREATE TABLE IF NOT EXISTS catalog.demo.ABTHudi USING org.apache.hudi.Spark3DefaultSource OPTIONS ('primaryKey' = 'ID','hoodie.table.name' = 'ABTHudi') AS SELECT * FROM pa...

  • 18 Views
  • 1 replies
  • 0 kudos
Latest Reply
AbhaySingh
Databricks Employee
  • 0 kudos

Support for Spark 4.0 is still an open issuse https://issues.apache.org/jira/browse/HUDI-7915 Please use the preview for testing but you may have to think about design for production with supported releases. 

  • 0 kudos
thewfhengineer
by New Contributor III
  • 9 Views
  • 0 replies
  • 0 kudos

AWS SageMaker to the Azure Databricks.

I'm starting a project to migrate our Compliance model - Python code (Pandas-based) from AWS SageMaker to the Azure ecosystem.Source: AWS (SageMaker, Airflow)Target: Azure (Databricks, ADLS)I'm evaluating the high-level approach and would appreciate ...

  • 9 Views
  • 0 replies
  • 0 kudos
BipinDatabricks
by New Contributor
  • 34 Views
  • 3 replies
  • 0 kudos

Using Databricks Sql Statement Execution api

TeamWe have internal chatbot service that will send query to data bricks SQL execution API.Number of queries vary from 50 per minute to 100 per minutes. and we are trying to limit response size by applying limit 10. Basically trying hard to use all o...

  • 34 Views
  • 3 replies
  • 0 kudos
Latest Reply
Coffee77
Contributor III
  • 0 kudos

As a suggestion, you can also think of creating your own API to query directly your tables via JDBC/ODBC connections over a SQL Warehouse. This case, limitations would be only those associated to SQL Warehouses and your API but not the Databricks API...

  • 0 kudos
2 More Replies
ShanQiwei
by New Contributor
  • 40 Views
  • 2 replies
  • 0 kudos

I/F security about using medallion architecture

I’m new to writing requirement definitions, and I’d like to ask a question about interface (I/F) security.My question is:Do I need to define the authentication and security mechanisms (such as OAuth2, Managed Identity, Service Principals, etc.) betwe...

  • 40 Views
  • 2 replies
  • 0 kudos
Latest Reply
Coffee77
Contributor III
  • 0 kudos

I'll try to summarize and go directly to the key points as I see this:- Client to S3  SAS Token or OAUTH 2.0 with Service to Service authentication (preferred)- Databricks to S3  Use Service Principal or Managed Identities (preferred)- Bronze/Silver/...

  • 0 kudos
1 More Replies
Techtic_kush
by New Contributor
  • 59 Views
  • 2 replies
  • 2 kudos

Can’t save results to target table – out-of-memory error

Hi team, I’m processing ~5,000 EMR notes with a Databricks notebook. The job reads from `crc_lakehouse.bronze.emr_notes`, runs SciSpaCy UMLS entity extraction plus a fine-tuned BERT sentiment model per partition, and builds a DataFrame (`df_entities`...

  • 59 Views
  • 2 replies
  • 2 kudos
Latest Reply
bianca_unifeye
New Contributor III
  • 2 kudos

You’re right that the behaviour is weird at first glance (“5k rows on a 64 GB cluster and I blow up on write”), but your stack trace is actually very revealing: this isn’t a classic Delta write / shuffle OOM – it’s SciSpaCy/UMLS falling over when loa...

  • 2 kudos
1 More Replies
mplang
by New Contributor
  • 4064 Views
  • 3 replies
  • 2 kudos

DLT x UC x Auto Loader

Now that the Directory Listing Mode of Auto Loader is officially deprecated, is there a solution for using File Notification Mode in a DLT pipeline writing to a UC-managed table? My understanding is that File Notification Mode is only available on si...

Data Engineering
autoloader
dlt
UC
  • 4064 Views
  • 3 replies
  • 2 kudos
Latest Reply
Raman_Unifeye
Contributor III
  • 2 kudos

Databricks introduced Managed File Events which completely bypasses the need for the cluster's identity to provision cloud resources, resolving the conflict with the Shared cluster mode.Steps to Implement in DLTEnable File Events on the External Loca...

  • 2 kudos
2 More Replies
Sainath368
by New Contributor III
  • 56 Views
  • 3 replies
  • 2 kudos

Migrating from directory-listing to Autoloader Managed File events

Hi everyone,We are currently migrating from a directory listing-based streaming approach to managed file events in Databricks Auto Loader for processing our data in structured streaming.We have a function that handles structured streaming where we ar...

  • 56 Views
  • 3 replies
  • 2 kudos
Latest Reply
Raman_Unifeye
Contributor III
  • 2 kudos

Yes, for your setup, Databricks Auto Loader will create a separate event queue for each independent stream running with the cloudFiles.useManagedFileEvents = true option.As you are running - 1 stream per table, 1 unique directory per stream and 1 uni...

  • 2 kudos
2 More Replies

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!

Sign Up Now
Labels