cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Forum Posts

kjoth
by Contributor II
  • 17773 Views
  • 9 replies
  • 7 kudos

How to make the job fail via code after handling exception

Hi , We are capturing the exception if an error occurs using try except. But we want the job status to be failed once we got the exception. Whats the best way to do that. We are using pyspark.

  • 17773 Views
  • 9 replies
  • 7 kudos
Latest Reply
kumar_ravi
New Contributor III
  • 7 kudos

you can do some hack arround   dbutils = get_dbutils(spark)    tables_with_exceptions = []    for table_config in table_configs:        try:            process(spark, table_config)        except Exception as e:            exception_detail = f"Error p...

  • 7 kudos
8 More Replies
Mr__D
by New Contributor II
  • 15768 Views
  • 7 replies
  • 1 kudos

Resolved! Writing modular code in Databricks

Hi All, Could you please suggest to me the best way to write PySpark code in Databricks,I don't want to write my code in Databricks notebook but create python files(modular project) in Vscode and call only the primary function in the notebook(the res...

  • 15768 Views
  • 7 replies
  • 1 kudos
Latest Reply
Gamlet
New Contributor II
  • 1 kudos

Certainly! To write PySpark code in Databricks while maintaining a modular project in VSCode, you can organize your PySpark code into Python files in VSCode, with a primary function encapsulating the main logic. Then, upload these files to Databricks...

  • 1 kudos
6 More Replies
User16776430979
by New Contributor III
  • 46705 Views
  • 3 replies
  • 5 kudos

Best practices around bronze/silver/gold (medallion model) data lake classification?

What's the best way to organize our data lake and delta setup? We’re trying to use the bronze, silver and gold classification strategy. The main question is how do we know what classification the data is inside Databricks if there’s no actual physica...

  • 46705 Views
  • 3 replies
  • 5 kudos
Latest Reply
-werners-
Esteemed Contributor III
  • 5 kudos

with Unity taking into account, it is certainly a good idea to think about your physical data storage.As you cannot have overlap between volumes and tables this can become cumbersome.F.e. we used to store delta tables of a data object in the same dir...

  • 5 kudos
2 More Replies
Chris_Shehu
by Valued Contributor III
  • 18529 Views
  • 5 replies
  • 5 kudos

Resolved! What is the best way to handle big data sets?

I'm trying to find the best strategy for handling big data sets. In this case I have something that is 450 million records. I'm pulling the data from SQL Server very quickly but when I try to push the data to the Delta Table OR a Azure Container the...

  • 18529 Views
  • 5 replies
  • 5 kudos
Latest Reply
Wilynan
New Contributor II
  • 5 kudos

I think you should consult experts in Big Data for advice on this issue

  • 5 kudos
4 More Replies
Nick_Hughes
by New Contributor III
  • 7981 Views
  • 3 replies
  • 1 kudos

Best way to generate fake data using underlying schema

HiWe are trying to generate fake data to run our tests. For example, we have a pipeline that creates a gold layer fact table form 6 underlying source tables in our silver layer. We want to generate the data in a way that recognises the relationships ...

  • 7981 Views
  • 3 replies
  • 1 kudos
Latest Reply
RonanStokes_DB
Databricks Employee
  • 1 kudos

Hi @Nick_Hughes This may be late for your scenario - but hopefully others facing similar issues will find it useful.You can specify how data is generated in `dbldatagen` using rules in the data generation spec. If rules are specified for data generat...

  • 1 kudos
2 More Replies
Pbarbosa154
by New Contributor III
  • 1274 Views
  • 2 replies
  • 0 kudos

What is the best way to ingest GCS data into Databricks and apply Anomaly Detection Model?

I recently started exploring the field of Data Engineering and came across some difficulties. I have a bucket in GCS with millions of parquet files and I want to create an Anomaly Detection model with them. I was trying to ingest that data into Datab...

  • 1274 Views
  • 2 replies
  • 0 kudos
Latest Reply
Anonymous
Not applicable
  • 0 kudos

@Pedro Barbosa​ :It seems like you are running out of memory when trying to convert the PySpark dataframe to an H2O frame. One possible approach to solve this issue is to partition the PySpark dataframe before converting it to an H2O frame.You can us...

  • 0 kudos
1 More Replies
lugger1
by New Contributor III
  • 2869 Views
  • 1 replies
  • 1 kudos

Resolved! What is the best way to use credentials for API calls from databricks notebook?

Hello, I have an Databricks account on Azure, and the goal is to compare different image tagging services from Azure, GCP, AWS via corresponding API calls, with Python notebook. I have problems with GCP vision API calls, specifically with credentials...

  • 2869 Views
  • 1 replies
  • 1 kudos
Latest Reply
lugger1
New Contributor III
  • 1 kudos

Ok, here is a trick: in my case, the file with GCP credentials is stored in notebook workspace storage, which is not visible to os.environ() command. So solution is to read a content of this file, and save it to the cluster storage attached to the no...

  • 1 kudos
sbux
by New Contributor
  • 2410 Views
  • 2 replies
  • 0 kudos

What is the best practice for tracing databricks - observe and writestream data record flow

Trying to connect dots on method below through a new event on Azure eventhub, storage, partition, avro records (those I can monitor) to my delta table? How do I trace observe, writeStream and the trigger? ... elif TABLE_TYPE == "live": print("D...

  • 2410 Views
  • 2 replies
  • 0 kudos
Latest Reply
Anonymous
Not applicable
  • 0 kudos

Hi @David Martin​ Thank you for posting your question in our community! We are happy to assist you.To help us provide you with the most accurate information, could you please take a moment to review the responses and select the one that best answers ...

  • 0 kudos
1 More Replies
yousry
by New Contributor II
  • 4538 Views
  • 2 replies
  • 2 kudos

Resolved! What is the best way to find deltalake version on OSS and Databricks at runtime?

To identify certain deltalake features available on a certain installation, it is important to have a robust way to identify deltalake version. For OSS, I found that the below Scala snippet will do the job.import io.delta println(io.delta.VERSION)Not...

  • 4538 Views
  • 2 replies
  • 2 kudos
Latest Reply
shan_chandra
Databricks Employee
  • 2 kudos

@Yousry Mohamed​ - could you please check the DBR runtime release notes for the Delta lake API compatibility matrix section ( DBR version vs Delta lake compatible version) for the mapping.Reference: https://docs.databricks.com/release-notes/runtime/r...

  • 2 kudos
1 More Replies
dshao
by New Contributor II
  • 5846 Views
  • 2 replies
  • 0 kudos

Resolved! Best way to get one row back per ID? Select Distinct is not working.

Here is the current output for my select statement. I would like it to return one row for this jobsubmissionid, where it selects only the non-zero value from each of the rows. I tried using SELECT DISTINCT jobsubmissionidbut it still returned 5 rows.

image
  • 5846 Views
  • 2 replies
  • 0 kudos
Latest Reply
UmaMahesh1
Honored Contributor III
  • 0 kudos

Is that the complete query you are using. I'm guessing that you are using select distinct * from table_name. If you wanted a individual column distinct value you have to apply a filter condition or aggregate the data accordingly. Anyways, a complete ...

  • 0 kudos
1 More Replies
pjp94
by Contributor
  • 2717 Views
  • 1 replies
  • 0 kudos

ERROR - Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.

I get the below error when trying to run multi-threading - fails towards the end of the run. My guess is it's related to memory/worker config. I've seen some solutions involving modifying the number of workers or CPU on the cluster - however that's n...

  • 2717 Views
  • 1 replies
  • 0 kudos
Latest Reply
pjp94
Contributor
  • 0 kudos

Since I don't have permissions to change cluster configurations, the only solution that ended up working was setting a max thread count to about half of the actual max so I don't overload the containers. However, open to any other optimization ideas!

  • 0 kudos
chandan_a_v
by Valued Contributor
  • 9040 Views
  • 2 replies
  • 4 kudos

Best way to run the Databricks notebook in a parallel way

Hi All,I need to run a Databricks notebook in a parallel way for different arguments. I tried with the threading approach but only the first 2 threads successfully execute the notebook and the rest fail. Please let me know if there is any best way to...

  • 9040 Views
  • 2 replies
  • 4 kudos
Latest Reply
Vidula
Honored Contributor
  • 4 kudos

Hey there @Chandan Angadi​ Does @Hubert Dudek​  response answer your question? If yes, would you be happy to mark it as best so that other members can find the solution more quickly?We'd love to hear from you.Thanks!

  • 4 kudos
1 More Replies
OliverLewis
by New Contributor
  • 2243 Views
  • 2 replies
  • 1 kudos

Parallelize spark jobs on the same cluster?

Whats the best way to parallelize multiple spark jobs on the same cluster during a backfill?

  • 2243 Views
  • 2 replies
  • 1 kudos
Latest Reply
ron_defreitas
Contributor
  • 1 kudos

In the past I used direct multi-threaded orchestration inside of driver applications, but that was prior to Databricks supporting multi-task jobs.If you create a job through the workflows tab, you can set up multiple notebooks, python, or jar tasks t...

  • 1 kudos
1 More Replies
Leladams
by New Contributor III
  • 9950 Views
  • 9 replies
  • 2 kudos

What is the best way to read in a ms access .accdb database into Databricks from a mounted drive?

I am currently trying to read in .accdb files from a mounted drive. Based on my research it looks like I would have to use a package like JayDeBeApi with ucanaccess drivers or pyodbc with ms access drivers.Will this work?Thanks for any help.

  • 9950 Views
  • 9 replies
  • 2 kudos
Latest Reply
Anonymous
Not applicable
  • 2 kudos

Hi @Leland Adams​ Hope you are doing well. Thank you for posting your question and giving us additional information. Do you think you were able to solve the query?We'd love to hear from you.

  • 2 kudos
8 More Replies
JBOCACHICA
by New Contributor III
  • 499 Views
  • 0 replies
  • 1 kudos

Primer vez en este evento, la verdad muy buen evento, aunque este foro no tiene contenido en español, la comunidad hispano parlante está creciendo y e...

Primer vez en este evento, la verdad muy buen evento, aunque este foro no tiene contenido en español, la comunidad hispano parlante está creciendo y esperamos poder aportar en el desarrollo de nuestros paises a traves de la tecnologia!.First time on ...

  • 499 Views
  • 0 replies
  • 1 kudos
Labels