Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
Hi! I'm starting to test configs on DataBricks, for example, to avoid corrupting data if two processes try to write at the same time:.config('spark.databricks.delta.multiClusterWrites.enabled', 'false')Or if I need more partitions than default .confi...
Hey there @Alejandro Martinez Hope everything is going well.Just wanted to see if you were able to find an answer to your question. If yes, would you be happy to let us know and mark it as best so that other members can find the solution more quickl...
I created some ETL using DataFrames in python. It used to run 180 sec. But it is not taking ~ 1200 sec. I have been changing it, so it could be something that I introduced, or something in the environment.Part of the process is appending results into...
I am having a problem very similar. Since yesterday, without a known reason, some commands that used to run daily are now stuck in a "Running command" state. Commands like: dataframe.show(n=1) dataframe.toPandas() dataframe.description() dataframe.wr...
Greetings !I've been trying out DLT for a few days but I'm running into an unexpected issue when trying to use Koalas dropna in my pipeline.My goal is to drop all columns that contain only null/na values before writing it.Current code is this : @dlt...
Hi there,I'm using these two APIs to execute SQL statements and read output back when it's finished. However, seems it always returns only 1000 rows even though I need all the results (millions of rows), is there a solution for this? execute SQL: htt...
code example# a list of file pathlist_files_path = ["/dbfs/mnt/...", ..., "/dbfs/mnt/..."]# copy all file above to this folderdest_path=""/dbfs/mnt/..."for file_path in list_files_path: # copy function copy_file(file_path, dest_path)I am runni...
I am trying to check whether a certain datapoint exists in multiple locations.This is what my table looks like:I am checking whether the same datapoint is in two locations. The idea is that this datapoint should exist in BOTH locations, and be counte...
I have a large delta table that I would like to back up and I am wondering what is the best practice for backing it up. The goal is so that if there is any accidental corruption or data loss either at the Azure blob storage level or within Databricks...
Hi @deisou Just wanted to check in if you were able to resolve your issue. If yes, would you be happy to mark the answer as best? If not, please tell us so we can help you.Cheers!
I've seen the Databricks documentation on time series here. I'm using forecasts as a feature and those forecasts have both an as-of timestamp (when the forecast was generated) and a time step label (timestamp indicating the time of the forecasted obs...
We have use cases that require multiple versions of the same datasets to be available. For example, we have a knowledge graph made of entities of relations, and we have multiple versions of the knowledge graph that's distinguished by schema names ri...
Hey there @Kyle Gao Hope you are doing well. Thank you for posting your query.Just wanted to check in if you were able to resolve your issue or do you need more help? We'd love to hear from you.Cheers!
The error is as below. The error is intermittent. eg. - The same code throws the below issue for run 3 but doesn't throws issue for run 4. Then again throws issue for run 5.An error occurred while calling o1509.getCause. Trace:py4j.security.Py4JSecur...
Hi All,We are facing one un-usual issue while loading data into Delta table using Spark SQL. We have one delta table which have around 135 columns and also having PARTITIONED BY. For this trying to load 15 millions of data volume but its not loading ...
@Kaniz Fatma @Parker Temple I found an root cause its because of serialization. we are using UDF to drive an column on dataframe, when we are trying to load data into delta table or write data into parquet file we are facing serialization issue ....
This happens while creating temp view using below code blocklatest_data.createOrReplaceGlobalTempView("e_test")ideally this command should replace the view if e_test already exists instead it is throwing"Recursive view `global_temp`.`e_test` detecte...
Hundreds of thousands of people in the United States suffer from back pain at some point in their lives. Because of this, you don't have to suffer greatly. The advice in this article can assist you in lessening the daily agony that you experience. Pa...
I have a task to transform a dataframe. The task is to collect all the columns in a row and embed it into a JSON string as a column.Source DF:Target DF:
How Can I pass parameters from the data factory to databricks Jobs that is using a notebook but I know how to pass parameters from data factory to databricks notebooks when ADF calling directly the Notebook.