Data Engineering

Forum Posts

Sorted by:

by smoortema • Contributor

an hour ago

8 Views
1 replies
0 kudos

how to know which join type was used (broadcast, shuffle hash or sort merge join) for a query?

What is the best way to know what kind of join was used for a SQL query between broadcast, shuffle hash and sort merge? How can the spark UI or the query plan be interpreted?

Data Engineering

8 Views
1 replies
0 kudos

an hour ago

View Replies

Latest Reply

Louis_Frolio
Databricks Employee

36m ago

0 kudos

Hello @smoortema , here are some helpful tips and tricks. Here’s how to quickly determine which join strategy Spark used—between broadcast hash join, shuffle hash join, and sort-merge join—and how to read both the query plan and the Spark UI to ver...

0 kudos

36m ago

by Shivaprasad • New Contributor III

2 hours ago

20 Views
0 replies
0 kudos

Error while creating databricks custom app

I am trying to create a simple databricks custom app but I am getting Error: Could not import 'app'. error.app.yaml fileenv: - name: FLASK_APP value: '/Workspace/Users/sam@xxx.com/databricks_apps/hello-world_2025_11_13-16_19/Gaap_commentry/app'comm...

Data Engineering

20 Views
0 replies
0 kudos

2 hours ago

by DataRabbit • New Contributor II

10-12-2022 2:05:40 AM

21874 Views
5 replies
0 kudos

Resolved! py4j.security.Py4JSecurityException: Constructor public org.apache.spark.ml.feature.VectorAssembler(java.lang.String) is not whitelisted.

Hello, i have a problem.When I try to run the MLlib Assembler (from pyspark.ml.feature import VectorAssembler) I get this error and I don't know what to do anymore. Please help.

Data Engineering

21874 Views
5 replies
0 kudos

10-12-2022 2:05:40 AM

View Replies

Latest Reply

VenuG
New Contributor III

2 hours ago

0 kudos

Do you plan to support this in Serverless Free Edition? Migration from Community Edition to Serveless has been fraught with these limitations.

0 kudos

2 hours ago

4 More Replies

by dbernstein_tp • New Contributor II

3 hours ago

10 Views
0 replies
0 kudos

Failed to edit ingestion pipeline PostgreSQL slot name cannot be empty or null

I'm trying to add tables to an existing SQL server CDC ingestion pipeline and today am getting this mysterious error message. Failed to edit ingestion pipelinePostgreSQL slot name cannot be empty or null Have not encountered this before. Is this simp...

Data Engineering

10 Views
0 replies
0 kudos

3 hours ago

by Pratikmsbsvm • Contributor

5 hours ago

24 Views
2 replies
1 kudos

How to Design a Data Quality Framework for Medallion Architecture Data Pipeline

Hello,I am building a Data Pipeline which extract data from Oracle Fusion and Push it to Databricks Delta lake.I am using Bronze, Silver and Gold Approach.May someone please help me how to control all three segment that is Bronze, Silver and Gold wit...

Data Engineering

24 Views
2 replies
1 kudos

5 hours ago

View Replies

Latest Reply

nayan_wylde
Esteemed Contributor

4 hours ago

1 kudos

Here’s how you can implement DQ at each stage:Bronze LayerChecks:File format validation (CSV, JSON, etc.).Schema validation (column names, types).Row count vs. source system.Tools:Use Databricks Autoloader with schema evolution and badRecordsPathImpl...

1 kudos

4 hours ago

1 More Replies

by Shalabh007 • Honored Contributor

11-29-2022 10:52:08 AM

8884 Views
6 replies
19 kudos

Practice Exams for Databricks Certified Data Engineer Professional exam

Can anyone help with official Practice Exams set for Databricks Certified Data Engineer Professional exam, like we have below for Databricks Certified Data Engineer AssociatePractice exam for the Databricks Certified Data Engineer Associate exam

Data Engineering

8884 Views
6 replies
19 kudos

11-29-2022 10:52:08 AM

View Replies

Latest Reply

JOHNBOSCOW23
Visitor

4 hours ago

19 kudos

I Passed my Exam today thanks

19 kudos

4 hours ago

5 More Replies

by anhnnguyen • Visitor

yesterday

37 Views
2 replies
1 kudos

Adding maven dependency to ETL pipeline

Hello guys,I'm building ETL pipeline and need to access HANA data lake file system. In order to do that I need to have sap-hdlfs library in compute environment, library is available in maven repository.My job will have multiple notebook task and ETL ...

Data Engineering

37 Views
2 replies
1 kudos

yesterday

View Replies

Latest Reply

nayan_wylde
Esteemed Contributor

5 hours ago

1 kudos

DLT doesn’t have a UI for library installation, but you can:Use libraries configuration in the pipeline JSON or YAML spec:{ "libraries": [ { "maven": { "coordinates": "com.sap.hana.hadoop:sap-hdlfs:<version>" } } ] }Or...

1 kudos

5 hours ago

1 More Replies

by Andolina1 • New Contributor III

05-15-2025 5:19:36 AM

2889 Views
6 replies
1 kudos

How to trigger an Azure Data Factory pipeline through API using parameters

Hello All,I have a use case where I want to trigger an Azure Data Factory pipeline through API. Right now I am calling the API in Databricks and using Service Principal(token based) to connect to ADF from Databricks.The ADF pipeline has some paramete...

Data Engineering

2889 Views
6 replies
1 kudos

05-15-2025 5:19:36 AM

View Replies

Latest Reply

rfranco
Visitor

7 hours ago

1 kudos

Hello @Andolina1,try to send your payload like:body = {'curr_working_user': f'{parameters}'}response = requests.post(url, headers=headers, json=body)the pipeline's parameter should be named curr_working_user. With these changes your setup should work...

1 kudos

7 hours ago

5 More Replies

by Zbyszek • New Contributor

yesterday

18 Views
1 replies
0 kudos

Create a Hudi table with Databrick 17

Hi I'm trying to run my existing code which has worked on the older DB version.CREATE TABLE IF NOT EXISTS catalog.demo.ABTHudi USING org.apache.hudi.Spark3DefaultSource OPTIONS ('primaryKey' = 'ID','hoodie.table.name' = 'ABTHudi') AS SELECT * FROM pa...

Data Engineering

18 Views
1 replies
0 kudos

yesterday

View Replies

Latest Reply

AbhaySingh
Databricks Employee

7 hours ago

0 kudos

Support for Spark 4.0 is still an open issuse https://issues.apache.org/jira/browse/HUDI-7915 Please use the preview for testing but you may have to think about design for production with supported releases.

0 kudos

7 hours ago

by thewfhengineer • New Contributor III

8 hours ago

9 Views
0 replies
0 kudos

AWS SageMaker to the Azure Databricks.

I'm starting a project to migrate our Compliance model - Python code (Pandas-based) from AWS SageMaker to the Azure ecosystem.Source: AWS (SageMaker, Airflow)Target: Azure (Databricks, ADLS)I'm evaluating the high-level approach and would appreciate ...

Data Engineering

9 Views
0 replies
0 kudos

8 hours ago

by BipinDatabricks • New Contributor

yesterday

34 Views
3 replies
0 kudos

Using Databricks Sql Statement Execution api

TeamWe have internal chatbot service that will send query to data bricks SQL execution API.Number of queries vary from 50 per minute to 100 per minutes. and we are trying to limit response size by applying limit 10. Basically trying hard to use all o...

Data Engineering

34 Views
3 replies
0 kudos

yesterday

View Replies

Latest Reply

Coffee77
Contributor III

8 hours ago

0 kudos

As a suggestion, you can also think of creating your own API to query directly your tables via JDBC/ODBC connections over a SQL Warehouse. This case, limitations would be only those associated to SQL Warehouses and your API but not the Databricks API...

0 kudos

8 hours ago

2 More Replies

by ShanQiwei • New Contributor

yesterday

40 Views
2 replies
0 kudos

I/F security about using medallion architecture

I’m new to writing requirement definitions, and I’d like to ask a question about interface (I/F) security.My question is:Do I need to define the authentication and security mechanisms (such as OAuth2, Managed Identity, Service Principals, etc.) betwe...

Data Engineering

40 Views
2 replies
0 kudos

yesterday

View Replies

Latest Reply

Coffee77
Contributor III

11 hours ago

0 kudos

I'll try to summarize and go directly to the key points as I see this:- Client to S3 SAS Token or OAUTH 2.0 with Service to Service authentication (preferred)- Databricks to S3 Use Service Principal or Managed Identities (preferred)- Bronze/Silver/...

0 kudos

11 hours ago

1 More Replies

by Techtic_kush • New Contributor

Friday

59 Views
2 replies
2 kudos

Can’t save results to target table – out-of-memory error

Hi team, I’m processing ~5,000 EMR notes with a Databricks notebook. The job reads from `crc_lakehouse.bronze.emr_notes`, runs SciSpaCy UMLS entity extraction plus a fine-tuned BERT sentiment model per partition, and builds a DataFrame (`df_entities`...

Data Engineering

59 Views
2 replies
2 kudos

Friday

View Replies

Latest Reply

bianca_unifeye
New Contributor III

yesterday

2 kudos

You’re right that the behaviour is weird at first glance (“5k rows on a 64 GB cluster and I blow up on write”), but your stack trace is actually very revealing: this isn’t a classic Delta write / shuffle OOM – it’s SciSpaCy/UMLS falling over when loa...

2 kudos

yesterday

1 More Replies

by mplang • New Contributor

10-14-2024 8:59:10 AM

4064 Views
3 replies
2 kudos

DLT x UC x Auto Loader

Now that the Directory Listing Mode of Auto Loader is officially deprecated, is there a solution for using File Notification Mode in a DLT pipeline writing to a UC-managed table? My understanding is that File Notification Mode is only available on si...

Data Engineering

autoloader

dlt

4064 Views
3 replies
2 kudos

10-14-2024 8:59:10 AM

View Replies

Latest Reply

Raman_Unifeye
Contributor III

12 hours ago

2 kudos

Databricks introduced Managed File Events which completely bypasses the need for the cluster's identity to provision cloud resources, resolving the conflict with the Shared cluster mode.Steps to Implement in DLTEnable File Events on the External Loca...

2 kudos

12 hours ago

2 More Replies

by Sainath368 • New Contributor III

yesterday

56 Views
3 replies
2 kudos

Migrating from directory-listing to Autoloader Managed File events

Hi everyone,We are currently migrating from a directory listing-based streaming approach to managed file events in Databricks Auto Loader for processing our data in structured streaming.We have a function that handles structured streaming where we ar...

Data Engineering

56 Views
3 replies
2 kudos

yesterday

View Replies

Latest Reply

Raman_Unifeye
Contributor III

yesterday

2 kudos

Yes, for your setup, Databricks Auto Loader will create a separate event queue for each independent stream running with the cloudFiles.useManagedFileEvents = true option.As you are running - 1 stream per table, 1 unique directory per stream and 1 uni...

2 kudos

yesterday

2 More Replies

Databricks Community

Forum Posts

how to know which join type was used (broadcast, shuffle hash or sort merge join) for a query?

Error while creating databricks custom app

Resolved! py4j.security.Py4JSecurityException: Constructor public org.apache.spark.ml.feature.VectorAssembler(java.lang.String) is not whitelisted.

Failed to edit ingestion pipeline PostgreSQL slot name cannot be empty or null

How to Design a Data Quality Framework for Medallion Architecture Data Pipeline

Practice Exams for Databricks Certified Data Engineer Professional exam

Adding maven dependency to ETL pipeline

How to trigger an Azure Data Factory pipeline through API using parameters

Create a Hudi table with Databrick 17

AWS SageMaker to the Azure Databricks.

Using Databricks Sql Statement Execution api

I/F security about using medallion architecture

Can’t save results to target table – out-of-memory error

DLT x UC x Auto Loader

Migrating from directory-listing to Autoloader Managed File events

Join Us as a Local Community Builder!

Data Pipeline for Bringing Data from Oracle Fusion...

Accessing Databricks data in Salesforce via zero c...

Issues recreating Tables with enableRowTracking an...

Uplimit lab access limit exceeded- How I can get m...

No rows returned when calling Databricks procedure...