Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
Hi databricks/spark experts!I have a piece on pandas-based 3rd party code that I need to execute as a part of a bigger spark pipeline. By nature, pandas-based code is executed on driver node. I ran into out of memory problems and started exploring th...
Hi @wojciech_jakubo 1. JVM memory will not be utilized for python related activities. 2. In the image we could only see the storage memory. We also have execution memory which would also be the same. Hence I came up with the executor memory to be of ...
Hello,I am somewhat new to Databricks and am trying to build a Q&A application based on a collection of documents. I need to move .pdf and .docx files from my local machine to storage in Databricks and eventually a document store. My questions are:Wh...
Hi all,I took an initial stab at task one with some success using the Databricks CLI. Here are the steps below:Open Command/Anaconda prompt and enter: pip install databricks-cliGo to your Databricks console and under settings find "User Settings" and...
Hi, can anybody answer this question I posted on StackOverflow? https://stackoverflow.com/questions/73314048/databricks-how-to-exit-the-entire-job-in-the-notebooks-orchestration-scenario
@Vidula Khanna​ @Vidula Khanna​ We are experiencing the same issue in our Workflows and I was wondering if there has been any update.We need the functionality to call a method similar to `dbutils.notebook.exit` in a notebook that will cancel the exec...
Hi @Govardhana Reddy​ Hope everything is going great.Does @Suteja Kanuri​'s answer help? If yes, would you be happy to mark an answer as best so that other members can find the solution more quickly? If not, please tell us so we can help you. Cheers!
Hi guys,I have a question about upsert/merge ... What do you do when que origin NOT exists, but you need to change status in the target​For exemple:01/03 : source dataset [ id =1 and status = Active] ; target table [*not exists*] >> in this time the ...
Hello @William Scardua​ , Just adding to what @Vigneshraja Palaniraj​ replied.Reference: https://docs.databricks.com/sql/language-manual/delta-merge-into.htmlThanks & Regards,Nandini
It is the practice exam for data engineer associateThe question is:A data engineering team has created a series of tables using Parquet data stored in an external system. The team is noticing that after appending new rows to the data in the external ...
Not an answer, just asking the databricks folks to clarify:I would also like to understand this. If there is no event emitted from the external parquet table (push) , and no active pulling or refreshing from the delta table side (pull), how is the un...
Dear Experts, Can anyone please let me know how option "C" is the answer to Question 31 for PracticeExam-DataEngineerAssociate. https://files.training.databricks.com/assessments/practice-exams/PracticeExam-DataEngineerAssociate.pdf?_ga=2.185796329.11...
Question 17 is even worse. "A data engineer is overwriting data" vs "should simply be overwritten instead"One situation I assume is DROP and CREATE and another is INSERT INTO OVERWRITE but here both are called the same.A data engineer is overwriting ...
Say, I have a job with 10 parallel tasks. I had to cancel one of the tasks to fix something and I unable to restart just that task. Is this by design? Should I restart the job in this case.Q2) If one of the tasks fails, will it auto recover just tha...
Hi @Jin Kim​, Please enable "Task Orchestration in Jobs" in your Admin Console, and then you can add as many tasks to your job. You can also specify the dependency of your task.
I'm learning the Data Engineeing with Databricks of Course, I have a question.if I run cmd4, it tells me an error.Course URL:https://customer-academy.databricks.com/learn/course/62/play/4290/providing-options-for-external-sources;lp=10Chapter: DE 4....
I have a customer with the following question - I'm posting on their behalf to introduce them to the community. For doing modeling in a python environment what is our best practice for getting the data from redshift? A "load" option seems to leave me...
(This is a copy of a question I asked on stackoverflow here, but maybe this community is a better fit for the question):Setting: Delta-lake, Databricks SQL compute used by powerbi. I am wondering about the following scenario: We have a column `timest...
In query I would just query first by date (generated from timestamp which we want to query) and than by exact timestamp, so it will use partitioning benefit.
Hi @afshin riahi​ , Yes, Definitely I can help you with it.Please wait while I or someone from the community gets back with a response.Thank you for your patience .
I have a Bronze -> Silver -> Gold architecture for my ETL pipelines and all tables are Delta. I'm trying to understand what updates flow downstream when I make changes to the source table. Most importantly, if I run optimize on the source, does every...