cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Forum Posts

jeremy98
by Honored Contributor
  • 1276 Views
  • 2 replies
  • 0 kudos

start another workflow waiting the completion of a job-run of the same workflow

Hello community,I'm using DABs I want to know if It is possible to configure the yaml file a logic that allows me to run a workflow if the previous job run is finished of the same workflow. Is it possible to do it? Do I need to create a task that che...

  • 1276 Views
  • 2 replies
  • 0 kudos
Latest Reply
Alberto_Umana
Databricks Employee
  • 0 kudos

Hello @jeremy98, Yes, it is possible to configure a YAML file to run a workflow only if the previous job run of the same workflow has finished. You can achieve this by defining dependencies between tasks within the workflow. You can specify task depe...

  • 0 kudos
1 More Replies
ctiwari7
by New Contributor II
  • 1331 Views
  • 2 replies
  • 0 kudos

Databricks workflow job

Hi team,I am trying to execute a workflow job which takes in a parameter as unique identifier. I am using this job parameter to push down to tasks. I was hoping if there is any way for me to use python uuid4() function to generate unique id every tim...

  • 1331 Views
  • 2 replies
  • 0 kudos
Latest Reply
Stefan-Koch
Valued Contributor II
  • 0 kudos

hi ctiwari7A possible way to do that, you create a python file which generates the uuid and then pass it to jobs.taskValues. This is described here: https://docs.databricks.com/en/jobs/task-values.html As test, I created a python file, with the follo...

  • 0 kudos
1 More Replies
ctiwari7
by New Contributor II
  • 1993 Views
  • 2 replies
  • 1 kudos

get job run link based on the job name or the submit body

This is the current code(ignore indentations) that I am using which takes the list of all the running jobs and then filters from the list to get the run id of the matching job name. I want to know if there is any better way to optimise this. Legacy d...

  • 1993 Views
  • 2 replies
  • 1 kudos
Latest Reply
ctiwari7
New Contributor II
  • 1 kudos

even the rest API also provides the job details based on the job id which I would need to get from the job_name that I have. This seems like the only possible solution since job_id is the true identifier of any workflow job considering we can have mu...

  • 1 kudos
1 More Replies
Isa1
by New Contributor III
  • 2165 Views
  • 6 replies
  • 3 kudos

Resolved! Moving existing Delta Live Table to Asset Bundle

Hi!I am creating an Asset Bundle, which also includes my streaming Delta Live Table Pipelines. I want to move these DLT pipelines to the Asset Bundle, without having to run my DLT streaming Pipeline on all historical files (this takes a lot of comput...

  • 2165 Views
  • 6 replies
  • 3 kudos
Latest Reply
Walter_C
Databricks Employee
  • 3 kudos

When you change the path to the notebook or the name of the pipeline in your Delta Live Table (DLT) pipeline, it can indeed cause issues. Specifically, changing the path to the notebook or the name of the pipeline can lead to the recreation of the pi...

  • 3 kudos
5 More Replies
shadowinc
by New Contributor III
  • 1080 Views
  • 1 replies
  • 2 kudos

Delete Partition Folders

Hello team, as DataBricks moved away from hive-style partitioning, we can see some 2-letter partition folders created. And I have observed that the vacuum doesn't delete these folders (even though they are empty). Is there any way to delete those usi...

Data Engineering
delta
vacuum
  • 1080 Views
  • 1 replies
  • 2 kudos
Latest Reply
Alberto_Umana
Databricks Employee
  • 2 kudos

Hello @shadowinc, VACUUM is used to clean up unused and stale data files that are no longer referenced by a Delta table and are older than a specified retention period (default is 7 days). It does not remove empty directories. I think manual cleanup ...

  • 2 kudos
Hubert-Dudek
by Esteemed Contributor III
  • 19060 Views
  • 6 replies
  • 19 kudos

Resolved! Optimize and Vacuum - which is the best order of operations?

Optimize -> VacuumorVacuum -> Optimize

  • 19060 Views
  • 6 replies
  • 19 kudos
Latest Reply
shadowinc
New Contributor III
  • 19 kudos

What about ReOrg delta table https://learn.microsoft.com/en-us/azure/databricks/sql/language-manual/delta-reorg-tableDoes it help or make sense to add Re-org then Optimize - Vacuum every week?Reorganize a Delta Lake table by rewriting files to purge ...

  • 19 kudos
5 More Replies
ayush19
by New Contributor III
  • 907 Views
  • 2 replies
  • 0 kudos

Running a jar on Databricks shared cluster using Airflow

Hello,I have a requirement to run a jar already installed on a Databricks cluster. It needs to be orchestrated using Apache Airflow. I followed the docs for the operator which can be used to do so https://airflow.apache.org/docs/apache-airflow-provid...

  • 907 Views
  • 2 replies
  • 0 kudos
Latest Reply
Alberto_Umana
Databricks Employee
  • 0 kudos

Hello @ayush19, Here are some suggestions, but would need to check how are the parameters configured. Use an Existing Cluster: Instead of creating a new cluster each time, configure the DatabricksSubmitRunOperator to use an existing cluster. This can...

  • 0 kudos
1 More Replies
mjedy7
by New Contributor II
  • 992 Views
  • 1 replies
  • 0 kudos

Reading two big tables within each forEachBatch processing method

I am reading changes from the cdf with availableOnce=True, processing data from checkpoint to checkpoint. During each batch, I perform transformations, but I also need to read two large tables and one small table. Does Spark read these tables from sc...

  • 992 Views
  • 1 replies
  • 0 kudos
Latest Reply
radothede
Valued Contributor II
  • 0 kudos

Hi @mjedy7 for cacheing in this scenario You could try to levarage persist() and unpersist() for the big table/ spark dataframe, see here:https://medium.com/@eloutmadiabderrahim/persist-vs-unpersist-in-spark-485694f72452Try to reduce the amount of da...

  • 0 kudos
hk-modi
by New Contributor
  • 816 Views
  • 1 replies
  • 0 kudos

Switching to autoloader

I have an S3 bucket that has continuous data being written into it. My script reads these files, parses them and then appends into a delta table. The data backs to 2022 with millions of files which are stored using partitions based on year/month/dayO...

  • 816 Views
  • 1 replies
  • 0 kudos
Latest Reply
radothede
Valued Contributor II
  • 0 kudos

hi @hk-modi As I understand correctly, You have an existing delta table with tons of data already processed. You want to switch to autoloader, read files, parse them and process data incrementally to that delta table as a sink. The task is to start p...

  • 0 kudos
lauraxyz
by Contributor
  • 1414 Views
  • 3 replies
  • 1 kudos

Resolved! Rendering Volumes file content programmatically

Hi there!I have some files stored in Volume, and I have a use case that I need to show the file content in a UI.  Say I have a REST API that already knows the Volume path to the file, is there any built-in feature from Databricks that i can use to he...

  • 1414 Views
  • 3 replies
  • 1 kudos
Latest Reply
cgrant
Databricks Employee
  • 1 kudos

Hi @lauraxyz, the files API should be helpful, particularly the upload endpoint.

  • 1 kudos
2 More Replies
NehaR
by New Contributor III
  • 727 Views
  • 1 replies
  • 3 kudos

Cost estimation before query execution similar to google cloud Big Query equivalent of --dry_run

Hi , In databricks do we have a option to estimate cost of query before execution which is similar to Big Query equivalent of --dry_run.Our use case is to estimate cost before execution and get alerted. RegardsNeha   

  • 727 Views
  • 1 replies
  • 3 kudos
Latest Reply
Alberto_Umana
Databricks Employee
  • 3 kudos

Hello @NehaR, Currently, Databricks does not have a direct equivalent to BigQuery's --dry_run feature for estimating the cost of a query before execution. However, there are some mechanisms and ongoing projects that aim to provide similar functionali...

  • 3 kudos
Shivam_Pawar
by New Contributor III
  • 18998 Views
  • 15 replies
  • 5 kudos

Databricks Lakehouse Fundamentals Badge

I have successfully passed the test after completion of the course with 95%. But I have'nt recieved any badge from your side as promised. I have been provided with a certificate which looks fake by itself. I need to post my credentials on Linkedin wi...

  • 18998 Views
  • 15 replies
  • 5 kudos
Latest Reply
heybeckerj
New Contributor II
  • 5 kudos

Any feedback on this please? 

  • 5 kudos
14 More Replies
deecee
by New Contributor II
  • 1272 Views
  • 2 replies
  • 0 kudos

SAS token issue for long running micro-batches

Hi everyone,I'm having an issue with some of our Databricks workloads. We're processing these workloads using the forEachBatch stream processing method. Whenever we are performing a full reload on some of our datasources, we get the following error. ...

Data Engineering
azure
Unity Catalog
  • 1272 Views
  • 2 replies
  • 0 kudos
Latest Reply
VZLA
Databricks Employee
  • 0 kudos

@deecee  Can you please confirm there are no external locations or volumes which can lead to this overlap of locations? what you actually have in "some_catalog.some_schema.some_table" and the "abfss://some-container@somestorageaccount.dfs.core.window...

  • 0 kudos
1 More Replies
kalebkemp
by New Contributor
  • 1084 Views
  • 2 replies
  • 0 kudos

FileReadException error when creating materialized view reading two schemas

Hi all. I'm getting an error `com.databricks.sql.io.FileReadException` when attempting to create a materialized view which reads tables from two different schemas in the same catalog. Is this just a limitation in databricks or do I potentially have s...

  • 1084 Views
  • 2 replies
  • 0 kudos
Latest Reply
agallard
Contributor
  • 0 kudos

Hi @kalebkemp ,The error you're encountering (com.databricks.sql.io.FileReadException) when creating a materialized view that reads from two different schemas in the same catalog might not necessarily be a Databricks limitation. It is more likely rel...

  • 0 kudos
1 More Replies
zmsoft
by Contributor
  • 4920 Views
  • 6 replies
  • 6 kudos

Azure Synapse vs Databricks

Hi there,I would like to know the difference between Azure Databricks and Azure Synapse, which use case is Databricks appropriate and which use case is Synapse appropriate? What are the differences in their functions? What are the differences in thei...

  • 4920 Views
  • 6 replies
  • 6 kudos
Latest Reply
thelogicplus
Contributor II
  • 6 kudos

share you use case i will suggest you about technology difference and which could be benefical for you. I love Data brick due to many awesome feature that help sql developer to programmer(python/Scala) to solve the use case on DataBricks. but if you ...

  • 6 kudos
5 More Replies

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!

Sign Up Now
Labels