cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Forum Posts

yliu
by New Contributor III
  • 14748 Views
  • 2 replies
  • 1 kudos

Z-ordering optimization with multithreading

Hi, I am wondering if multithreading will help with the performance for z-ordering optimization on multiple delta tables.We are periodically doing optimization on thousands of tables and it easily takes a few days to finish the job. So we are looking...

  • 14748 Views
  • 2 replies
  • 1 kudos
Eeg
by New Contributor III
  • 23835 Views
  • 4 replies
  • 5 kudos

Pyflake errors when using %run

I am using %run command to import shared resources for each of my processes. Because it was the most easy way to import my common libraries. However, in that way, pyflake can't resolve the dependencies quite well. And I end up working in code with ma...

  • 23835 Views
  • 4 replies
  • 5 kudos
Latest Reply
btafur
Databricks Employee
  • 5 kudos

You could use something like flake8 and customize the rules in the .flake8 file or ignore specific lines with #noqa. https://flake8.pycqa.org/en/latest/user/configuration.html

  • 5 kudos
3 More Replies
turagittech
by Contributor
  • 7569 Views
  • 0 replies
  • 0 kudos

Pandas 2.x availability

Hi All,I am wondering if Pandas 2.x will be available soon or is it an available option to install.I have a small job I built to manipulate some strings from  a database table when technically did the job, but doesn't scale with older versions of pan...

  • 7569 Views
  • 0 replies
  • 0 kudos
melodiesd
by New Contributor
  • 8667 Views
  • 0 replies
  • 0 kudos

Parse_Syntax_Error Help

Hello all, I'm new to Databricks and can't figure out why I'm getting an error in my SQL code.Error in SQL statement: ParseException: [PARSE_SYNTAX_ERROR] Syntax error at or near 'if'.(line 1, pos 0) == SQL == if OBJECT_ID('tempdb.#InitialData') IS N...

  • 8667 Views
  • 0 replies
  • 0 kudos
pygreg
by New Contributor
  • 3762 Views
  • 1 replies
  • 1 kudos

Resolved! Workflows : pass parameters to a "run job" task

Hi folks!I would like to know if there is a way to pass parameters to a "run job" task.For example:Let's have a Job A with:a notebook task A.1 that takes as input a parameter year-month in the format yyyymma "run job" task A.2 that calls a Job BI wou...

  • 3762 Views
  • 1 replies
  • 1 kudos
Latest Reply
btafur
Databricks Employee
  • 1 kudos

This feature will be available soon as part of Job Parameters. Right now it is not possible to easily pass parameters to a child job.

  • 1 kudos
peterwishart
by New Contributor III
  • 7226 Views
  • 4 replies
  • 0 kudos

Resolved! Programmatically updating the “run_as_user_name” parameter for jobs

I am trying to write a process that will programmatically update the “run_as_user_name” parameter for all jobs in an Azure Databricks workspace, using powershell to interact with the Jobs API. I have been trying to do this with a test job without suc...

  • 7226 Views
  • 4 replies
  • 0 kudos
Latest Reply
baubleglue
New Contributor II
  • 0 kudos

  Solution you've submitted is a solution for different topic (permission to run job, the job still runs as the user in run_as_user_name field). Here is an example of changing "run_as_user_name"Docs:https://docs.databricks.com/api/azure/workspace/job...

  • 0 kudos
3 More Replies
Hubert-Dudek
by Databricks MVP
  • 3193 Views
  • 1 replies
  • 0 kudos

Spark Configuration Parameter for Cluster Downscaling

spark.databricks.aggressiveWindowDownS This parameter is designed to determine the frequency, in seconds, at which the cluster decides to downscale.By adjusting this setting, you can fine-tune how rapidly clusters release workers. A higher value will...

  • 3193 Views
  • 1 replies
  • 0 kudos
Latest Reply
Haiyangl104
New Contributor III
  • 0 kudos

I wish there was a configuration to toggle upscaling behavior. I want the clusters to scale up only if the bottleneck is approaching 70% memory usage. Currently the autoscaling is only based on CPU not Memory (RAM).

  • 0 kudos
SaraCorralLou
by New Contributor III
  • 2181 Views
  • 1 replies
  • 0 kudos

Clean driver during notebook execution

Is there any way to clear the memory driver during the execution of my notebook? I have several functions that are executed in the driver and that generate in it different dataframes that are not necessary (these dataframes are created just to do som...

  • 2181 Views
  • 1 replies
  • 0 kudos
Latest Reply
-werners-
Esteemed Contributor III
  • 0 kudos

since spark uses lazy execution, those dataframes you do not need cannot be cleared unless you do use them (why define them otherwise?).So when doing an action, spark will execute all code that is necessary.  If you run into memory issues, you can do...

  • 0 kudos
alesventus
by Contributor
  • 1870 Views
  • 0 replies
  • 0 kudos

Performance issue: Running 50 notebooks from ADF

I have process in Data factory, that loads CDC changes from sql server and then trigger notebook with merge to bronze and silver zone. Single notebook takes about 1 minute to run but when all 50 notebooks are fired at once the whole process takes 25 ...

Data Engineering
performance issue
  • 1870 Views
  • 0 replies
  • 0 kudos
Greg
by New Contributor III
  • 2465 Views
  • 1 replies
  • 4 kudos

How to reduce storage space consumed by delta with many updates

I have 1 delta table that I continuously append events into, and a 2nd delta table that I continuously merge into (streamed from the 1st table) that has unique ID's where properties are updated from the events (An ID represents a unique thing that ge...

  • 2465 Views
  • 1 replies
  • 4 kudos
Latest Reply
Jb11
New Contributor II
  • 4 kudos

Did you already solved this problem?

  • 4 kudos
bfridley
by New Contributor II
  • 4530 Views
  • 2 replies
  • 0 kudos

DLT Pipeline Out Of Memory Errors

I have a DLT pipeline that has been running for weeks. Now, trying to rerun the pipeline with the same code and same data fails. I've even tried updating the compute on the cluster to about 3x of what was previously working and it still fails with ou...

bfridley_1-1695328329708.png bfridley_2-1695328372419.png
  • 4530 Views
  • 2 replies
  • 0 kudos
Latest Reply
rajib_bahar_ptg
New Contributor III
  • 0 kudos

I'd focus on understanding the codebase first. It'll help you decide what logic or data asset to keep or not keep when you try to optimize it. If you share the architecture of the application, the problem it solves, and some sample code here, it'll h...

  • 0 kudos
1 More Replies
gkrilis
by New Contributor
  • 9386 Views
  • 1 replies
  • 0 kudos

How to stop SparkSession within notebook without errr

I want to run an ETL job and when the job ends I would like to stop SparkSession to free my cluster's resources, by doing this I could avoid restarting the cluster, but when calling spark.stop() the job returns with status failed even though it has f...

Data Engineering
cluster
SparkSession
  • 9386 Views
  • 1 replies
  • 0 kudos
Latest Reply
PremadasV
New Contributor II
  • 0 kudos

Please refer to this Job fails, but Apache Spark tasks finish - Databricks

  • 0 kudos
Gilg
by Contributor II
  • 1392 Views
  • 0 replies
  • 0 kudos

Add data manually to DLT

Hi Team,Is there a way that we can add data manually to the tables that are generated by DLT?We have done a PoC using DLT for Sep 15 to current data. Now, that they are happy, they wanted the previous data from Synapse and put into Databricks.I can e...

  • 1392 Views
  • 0 replies
  • 0 kudos

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!

Sign Up Now
Labels