Data Engineering

Forum Posts

Sorted by:

by User16776430979 • New Contributor III

06-07-2021 9:57:16 AM

53006 Views
4 replies
5 kudos

Best practices around bronze/silver/gold (medallion model) data lake classification?

What's the best way to organize our data lake and delta setup? We’re trying to use the bronze, silver and gold classification strategy. The main question is how do we know what classification the data is inside Databricks if there’s no actual physica...

Data Engineering

53006 Views
4 replies
5 kudos

06-07-2021 9:57:16 AM

View Replies

Latest Reply

G_E
New Contributor II

01-15-2025 2:01:06 AM

5 kudos

Has the reply from @Retired_mod been removed?

5 kudos

01-15-2025 2:01:06 AM

3 More Replies

by kjoth • Contributor II

02-01-2022 11:38:45 PM

20762 Views
9 replies
7 kudos

How to make the job fail via code after handling exception

Hi , We are capturing the exception if an error occurs using try except. But we want the job status to be failed once we got the exception. Whats the best way to do that. We are using pyspark.

Data Engineering

20762 Views
9 replies
7 kudos

02-01-2022 11:38:45 PM

View Replies

Latest Reply

kumar_ravi
New Contributor III

09-04-2024 4:19:28 AM

7 kudos

you can do some hack arround dbutils = get_dbutils(spark) tables_with_exceptions = [] for table_config in table_configs: try: process(spark, table_config) except Exception as e: exception_detail = f"Error p...

7 kudos

09-04-2024 4:19:28 AM

8 More Replies

by Mr__D • New Contributor II

03-23-2023 11:10:20 AM

26645 Views
7 replies
1 kudos

Resolved! Writing modular code in Databricks

Hi All, Could you please suggest to me the best way to write PySpark code in Databricks,I don't want to write my code in Databricks notebook but create python files(modular project) in Vscode and call only the primary function in the notebook(the res...

Data Engineering

26645 Views
7 replies
1 kudos

03-23-2023 11:10:20 AM

View Replies

Latest Reply

Gamlet
New Contributor II

01-17-2024 5:33:35 AM

1 kudos

Certainly! To write PySpark code in Databricks while maintaining a modular project in VSCode, you can organize your PySpark code into Python files in VSCode, with a primary function encapsulating the main logic. Then, upload these files to Databricks...

1 kudos

01-17-2024 5:33:35 AM

6 More Replies

by Chris_Shehu • Valued Contributor III

03-21-2022 9:59:31 PM

23345 Views
5 replies
5 kudos

Resolved! What is the best way to handle big data sets?

I'm trying to find the best strategy for handling big data sets. In this case I have something that is 450 million records. I'm pulling the data from SQL Server very quickly but when I try to push the data to the Delta Table OR a Azure Container the...

Data Engineering

23345 Views
5 replies
5 kudos

03-21-2022 9:59:31 PM

View Replies

Latest Reply

Wilynan
New Contributor II

08-11-2023 6:41:05 AM

5 kudos

I think you should consult experts in Big Data for advice on this issue

5 kudos

08-11-2023 6:41:05 AM

4 More Replies

by Nick_Hughes • New Contributor III

05-16-2023 3:43:03 AM

10048 Views
3 replies
1 kudos

Best way to generate fake data using underlying schema

HiWe are trying to generate fake data to run our tests. For example, we have a pipeline that creates a gold layer fact table form 6 underlying source tables in our silver layer. We want to generate the data in a way that recognises the relationships ...

Data Engineering

10048 Views
3 replies
1 kudos

05-16-2023 3:43:03 AM

View Replies

Latest Reply

RonanStokes_DB
Databricks Employee

07-13-2023 11:14:14 AM

1 kudos

Hi @Nick_Hughes This may be late for your scenario - but hopefully others facing similar issues will find it useful.You can specify how data is generated in `dbldatagen` using rules in the data generation spec. If rules are specified for data generat...

1 kudos

07-13-2023 11:14:14 AM

2 More Replies

by Pbarbosa154 • New Contributor III

04-28-2023 7:30:44 AM

1445 Views
2 replies
0 kudos

What is the best way to ingest GCS data into Databricks and apply Anomaly Detection Model?

I recently started exploring the field of Data Engineering and came across some difficulties. I have a bucket in GCS with millions of parquet files and I want to create an Anomaly Detection model with them. I was trying to ingest that data into Datab...

Data Engineering

1445 Views
2 replies
0 kudos

04-28-2023 7:30:44 AM

View Replies

Latest Reply

Anonymous
Not applicable

04-28-2023 10:34:53 AM

0 kudos

@Pedro Barbosa :It seems like you are running out of memory when trying to convert the PySpark dataframe to an H2O frame. One possible approach to solve this issue is to partition the PySpark dataframe before converting it to an H2O frame.You can us...

0 kudos

04-28-2023 10:34:53 AM

1 More Replies

by lugger1 • New Contributor III

04-19-2023 7:21:58 AM

3250 Views
1 replies
1 kudos

Resolved! What is the best way to use credentials for API calls from databricks notebook?

Hello, I have an Databricks account on Azure, and the goal is to compare different image tagging services from Azure, GCP, AWS via corresponding API calls, with Python notebook. I have problems with GCP vision API calls, specifically with credentials...

Data Engineering

3250 Views
1 replies
1 kudos

04-19-2023 7:21:58 AM

View Replies

Latest Reply

lugger1
New Contributor III

04-20-2023 4:21:00 PM

1 kudos

Ok, here is a trick: in my case, the file with GCP credentials is stored in notebook workspace storage, which is not visible to os.environ() command. So solution is to read a content of this file, and save it to the cluster storage attached to the no...

1 kudos

04-20-2023 4:21:00 PM

by sbux • New Contributor

03-02-2023 4:45:51 PM

2883 Views
2 replies
0 kudos

What is the best practice for tracing databricks - observe and writestream data record flow

Trying to connect dots on method below through a new event on Azure eventhub, storage, partition, avro records (those I can monitor) to my delta table? How do I trace observe, writeStream and the trigger? ... elif TABLE_TYPE == "live": print("D...

Data Engineering

2883 Views
2 replies
0 kudos

03-02-2023 4:45:51 PM

View Replies

Latest Reply

Anonymous
Not applicable

03-30-2023 1:55:17 AM

0 kudos

Hi @David Martin Thank you for posting your question in our community! We are happy to assist you.To help us provide you with the most accurate information, could you please take a moment to review the responses and select the one that best answers ...

0 kudos

03-30-2023 1:55:17 AM

1 More Replies

by yousry • New Contributor II

01-18-2023 5:06:39 AM

6122 Views
2 replies
2 kudos

Resolved! What is the best way to find deltalake version on OSS and Databricks at runtime?

To identify certain deltalake features available on a certain installation, it is important to have a robust way to identify deltalake version. For OSS, I found that the below Scala snippet will do the job.import io.delta println(io.delta.VERSION)Not...

Data Engineering

6122 Views
2 replies
2 kudos

01-18-2023 5:06:39 AM

View Replies

Latest Reply

shan_chandra
Databricks Employee

02-02-2023 10:09:25 AM

2 kudos

@Yousry Mohamed - could you please check the DBR runtime release notes for the Delta lake API compatibility matrix section ( DBR version vs Delta lake compatible version) for the mapping.Reference: https://docs.databricks.com/release-notes/runtime/r...

2 kudos

02-02-2023 10:09:25 AM

1 More Replies

by dshao • New Contributor II

12-19-2022 1:18:11 PM

6823 Views
2 replies
0 kudos

Resolved! Best way to get one row back per ID? Select Distinct is not working.

Here is the current output for my select statement. I would like it to return one row for this jobsubmissionid, where it selects only the non-zero value from each of the rows. I tried using SELECT DISTINCT jobsubmissionidbut it still returned 5 rows.

Data Engineering

6823 Views
2 replies
0 kudos

12-19-2022 1:18:11 PM

View Replies

Latest Reply

UmaMahesh1
Honored Contributor III

12-20-2022 4:25:30 AM

0 kudos

Is that the complete query you are using. I'm guessing that you are using select distinct * from table_name. If you wanted a individual column distinct value you have to apply a filter condition or aggregate the data accordingly. Anyways, a complete ...

0 kudos

12-20-2022 4:25:30 AM

1 More Replies

by pjp94 • Contributor

09-19-2022 11:19:43 AM

2896 Views
1 replies
0 kudos

ERROR - Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.

I get the below error when trying to run multi-threading - fails towards the end of the run. My guess is it's related to memory/worker config. I've seen some solutions involving modifying the number of workers or CPU on the cluster - however that's n...

Data Engineering

2896 Views
1 replies
0 kudos

09-19-2022 11:19:43 AM

View Replies

Latest Reply

pjp94
Contributor

09-19-2022 12:56:47 PM

0 kudos

Since I don't have permissions to change cluster configurations, the only solution that ended up working was setting a max thread count to about half of the actual max so I don't overload the containers. However, open to any other optimization ideas!

0 kudos

09-19-2022 12:56:47 PM

by chandan_a_v • Valued Contributor

07-13-2022 12:38:20 AM

9331 Views
2 replies
4 kudos

Best way to run the Databricks notebook in a parallel way

Hi All,I need to run a Databricks notebook in a parallel way for different arguments. I tried with the threading approach but only the first 2 threads successfully execute the notebook and the rest fail. Please let me know if there is any best way to...

Data Engineering

9331 Views
2 replies
4 kudos

07-13-2022 12:38:20 AM

View Replies

Latest Reply

Vidula
Honored Contributor

09-03-2022 10:46:25 PM

4 kudos

Hey there @Chandan Angadi Does @Hubert Dudek response answer your question? If yes, would you be happy to mark it as best so that other members can find the solution more quickly?We'd love to hear from you.Thanks!

4 kudos

09-03-2022 10:46:25 PM

1 More Replies

by OliverLewis • New Contributor

06-29-2022 9:51:16 AM

2548 Views
2 replies
1 kudos

Parallelize spark jobs on the same cluster?

Whats the best way to parallelize multiple spark jobs on the same cluster during a backfill?

Data Engineering

2548 Views
2 replies
1 kudos

06-29-2022 9:51:16 AM

View Replies

Latest Reply

ron_defreitas
Contributor

06-29-2022 11:45:45 AM

1 kudos

In the past I used direct multi-threaded orchestration inside of driver applications, but that was prior to Databricks supporting multi-task jobs.If you create a job through the workflows tab, you can set up multiple notebooks, python, or jar tasks t...

1 kudos

06-29-2022 11:45:45 AM

1 More Replies

by Leladams • New Contributor III

01-06-2022 9:33:14 AM

11061 Views
9 replies
2 kudos

What is the best way to read in a ms access .accdb database into Databricks from a mounted drive?

I am currently trying to read in .accdb files from a mounted drive. Based on my research it looks like I would have to use a package like JayDeBeApi with ucanaccess drivers or pyodbc with ms access drivers.Will this work?Thanks for any help.

Data Engineering

11061 Views
9 replies
2 kudos

01-06-2022 9:33:14 AM

View Replies

Latest Reply

Anonymous
Not applicable

04-13-2022 7:46:56 AM

2 kudos

Hi @Leland Adams Hope you are doing well. Thank you for posting your question and giving us additional information. Do you think you were able to solve the query?We'd love to hear from you.

2 kudos

04-13-2022 7:46:56 AM

8 More Replies

by JBOCACHICA • New Contributor III

06-28-2022 2:26:35 PM

1114 Views
0 replies
1 kudos

Primer vez en este evento, la verdad muy buen evento, aunque este foro no tiene contenido en español, la comunidad hispano parlante está creciendo y e...

Primer vez en este evento, la verdad muy buen evento, aunque este foro no tiene contenido en español, la comunidad hispano parlante está creciendo y esperamos poder aportar en el desarrollo de nuestros paises a traves de la tecnologia!.First time on ...

Data Engineering

1114 Views
0 replies
1 kudos

06-28-2022 2:26:35 PM