cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 
Data + AI Summit 2024 - Data Engineering & Streaming

Forum Posts

ctiwari7
by New Contributor II
  • 506 Views
  • 2 replies
  • 1 kudos

get job run link based on the job name or the submit body

This is the current code(ignore indentations) that I am using which takes the list of all the running jobs and then filters from the list to get the run id of the matching job name. I want to know if there is any better way to optimise this. Legacy d...

  • 506 Views
  • 2 replies
  • 1 kudos
Latest Reply
ctiwari7
New Contributor II
  • 1 kudos

even the rest API also provides the job details based on the job id which I would need to get from the job_name that I have. This seems like the only possible solution since job_id is the true identifier of any workflow job considering we can have mu...

  • 1 kudos
1 More Replies
ctiwari7
by New Contributor II
  • 7 Views
  • 0 replies
  • 0 kudos

Databricks workflow job

Hi team,I am trying to execute a workflow job which takes in a parameter as unique identifier. I am using this job parameter to push down to tasks. I was hoping if there is any way for me to use python uuid4() function to generate unique id every tim...

  • 7 Views
  • 0 replies
  • 0 kudos
Thor
by New Contributor III
  • 5 Views
  • 0 replies
  • 0 kudos

Asynchronous progress tracking with foreachbatch

Hello,currently the doc says that async progress tracking is available only for Kafka sink:https://docs.databricks.com/en/structured-streaming/async-progress-checking.htmlI would like to know if it would work for any sink that is "exactly once"?I exp...

  • 5 Views
  • 0 replies
  • 0 kudos
Frustrated_DE
by New Contributor III
  • 17 Views
  • 2 replies
  • 1 kudos

Data comparison

Hi,   Are there any tools within Databricks for large volume data comparisons, I appreciate there's methods for dataframe comparisons for unit testing (assertDataFrameEqual) but it is my understanding these are for testing transformations on smallish...

  • 17 Views
  • 2 replies
  • 1 kudos
Latest Reply
Frustrated_DE
New Contributor III
  • 1 kudos

Thanks Szymon, I will give these a try!

  • 1 kudos
1 More Replies
Isa1
by Visitor
  • 55 Views
  • 6 replies
  • 3 kudos

Resolved! Moving existing Delta Live Table to Asset Bundle

Hi!I am creating an Asset Bundle, which also includes my streaming Delta Live Table Pipelines. I want to move these DLT pipelines to the Asset Bundle, without having to run my DLT streaming Pipeline on all historical files (this takes a lot of comput...

  • 55 Views
  • 6 replies
  • 3 kudos
Latest Reply
Walter_C
Databricks Employee
  • 3 kudos

When you change the path to the notebook or the name of the pipeline in your Delta Live Table (DLT) pipeline, it can indeed cause issues. Specifically, changing the path to the notebook or the name of the pipeline can lead to the recreation of the pi...

  • 3 kudos
5 More Replies
shadowinc
by New Contributor III
  • 24 Views
  • 1 replies
  • 2 kudos

Delete Partition Folders

Hello team, as DataBricks moved away from hive-style partitioning, we can see some 2-letter partition folders created. And I have observed that the vacuum doesn't delete these folders (even though they are empty). Is there any way to delete those usi...

Data Engineering
delta
vacuum
  • 24 Views
  • 1 replies
  • 2 kudos
Latest Reply
Alberto_Umana
Databricks Employee
  • 2 kudos

Hello @shadowinc, VACUUM is used to clean up unused and stale data files that are no longer referenced by a Delta table and are older than a specified retention period (default is 7 days). It does not remove empty directories. I think manual cleanup ...

  • 2 kudos
Hubert-Dudek
by Esteemed Contributor III
  • 10749 Views
  • 6 replies
  • 17 kudos

Resolved! Optimize and Vacuum - which is the best order of operations?

Optimize -> VacuumorVacuum -> Optimize

  • 10749 Views
  • 6 replies
  • 17 kudos
Latest Reply
shadowinc
New Contributor III
  • 17 kudos

What about ReOrg delta table https://learn.microsoft.com/en-us/azure/databricks/sql/language-manual/delta-reorg-tableDoes it help or make sense to add Re-org then Optimize - Vacuum every week?Reorganize a Delta Lake table by rewriting files to purge ...

  • 17 kudos
5 More Replies
Kamal2
by New Contributor II
  • 15733 Views
  • 3 replies
  • 4 kudos

Resolved! PDF Parsing in Notebook

I have pdf files stored in azure adls.i want to parse pdf files in pyspark dataframeshow can i do that ?

  • 15733 Views
  • 3 replies
  • 4 kudos
Latest Reply
Mykola_Melnyk
  • 4 kudos

Please look to the PDF DataSource for Apache Spark.This project provides a custom data source for the Apache Spark that allows you to read PDF files into the Spark DataFrame. And here notebook with example of usage.df = spark.read.format("pdf") \ ...

  • 4 kudos
2 More Replies
Erik
by Valued Contributor III
  • 70 Views
  • 1 replies
  • 2 kudos

Managing streaming checkpoints with unity catalog

This is partly a question, partly a feature request: How do you guys handle streaming checkpoints in combination with unity catalog managed tables?It seems like the only way is to create a volume, and manually specify paths in it as streaming checkpo...

  • 70 Views
  • 1 replies
  • 2 kudos
Latest Reply
michelle653burk
  • 2 kudos

@Erik wrote:This is partly a question, partly a feature request: How do you guys handle streaming checkpoints in combination with unity catalog managed tables?It seems like the only way is to create a volume, and manually specify paths in it as strea...

  • 2 kudos
ayush19
by New Contributor III
  • 38 Views
  • 2 replies
  • 0 kudos

Running a jar on Databricks shared cluster using Airflow

Hello,I have a requirement to run a jar already installed on a Databricks cluster. It needs to be orchestrated using Apache Airflow. I followed the docs for the operator which can be used to do so https://airflow.apache.org/docs/apache-airflow-provid...

  • 38 Views
  • 2 replies
  • 0 kudos
Latest Reply
Alberto_Umana
Databricks Employee
  • 0 kudos

Hello @ayush19, Here are some suggestions, but would need to check how are the parameters configured. Use an Existing Cluster: Instead of creating a new cluster each time, configure the DatabricksSubmitRunOperator to use an existing cluster. This can...

  • 0 kudos
1 More Replies
htu
by New Contributor III
  • 4768 Views
  • 8 replies
  • 20 kudos

Installing Databricks Connect breaks pyspark local cluster mode

Hi, It seems that when databricks-connect is installed, pyspark is at the same time modified so that it will not anymore work with local master node. This has been especially useful in testing, when unit tests for spark-related code without any remot...

  • 4768 Views
  • 8 replies
  • 20 kudos
Latest Reply
lukany
Visitor
  • 20 kudos

Hi, we are facing this issue as well, i.e. RuntimeError as reported in this comment. We use the workaround with poetry groups as suggested in this comment.The workaround introduces unnecessary an non-intuitive complexity to dependency management and ...

  • 20 kudos
7 More Replies
NhanNguyen
by Contributor II
  • 42 Views
  • 1 replies
  • 0 kudos

ConcurrentAppendException After Delta Table was enable Liquid Clustering and Row level concurrency

Everytime I run parallel job it always failed with this error: ConcurrentAppendException: Files were added to the root of the table by a concurrent update. Please try the operation again.I did a lot of reseaches also create liquid clustering table an...

  • 42 Views
  • 1 replies
  • 0 kudos
Latest Reply
NhanNguyen
Contributor II
  • 0 kudos

Note: I tried both DBR 13.3.x and 14.3.x but still failed with same error

  • 0 kudos
dixonantony
by New Contributor II
  • 56 Views
  • 3 replies
  • 0 kudos

not able to create table from pyspark sql using databricks unity catalog open apis

I was trying to access databricks and do DDL/DML operations using databricks unity catalog open apis. The create schema and select tables are working, but create table is not working due to below issues, could you please help?I was using pyspark sql ...

  • 56 Views
  • 3 replies
  • 0 kudos
Latest Reply
Alberto_Umana
Databricks Employee
  • 0 kudos

Hello @dixonantony  Can you try running this command? spark.sql("create table datatest.dischema.demoTab1(id int, name VARCHAR(10), age int)") Ensure that you have the necessary permissions to create tables in Unity Catalog. You need the CREATE TABLE ...

  • 0 kudos
2 More Replies
isai-ds
by New Contributor
  • 108 Views
  • 1 replies
  • 0 kudos

Salesforce LakeFlow connect - Deletion Salesforce records

Hello, I am new in databricks and related to data engineering. I am running a POC to sync data between a Salesforce sandbox and Databricks using LakeFlow connect.I already make the connection and i successfully sync data between salesforce and databr...

  • 108 Views
  • 1 replies
  • 0 kudos
Latest Reply
cgrant
Databricks Employee
  • 0 kudos

Right now, the Salesforce connector only supports SCD Type 1. Please be on the lookout for SCD Type 2 functionality in the near future.

  • 0 kudos

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group
Labels