cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Forum Posts

Riccardo96
by New Contributor II
  • 2478 Views
  • 3 replies
  • 0 kudos

Dataframe Count before and after write command do not match

Hi,I have noticed a strange behaviour in a notebook where I am developing. When I use the notebook to read a single file the notebook works correctly, but when I set it to read multiple files at once, using the option recursive lookup, I have noticed...

  • 2478 Views
  • 3 replies
  • 0 kudos
Latest Reply
Riccardo96
New Contributor II
  • 0 kudos

I just found out I was populating a column with random variables, these variables are filtered in a join...so at each write and count those numbers change  

  • 0 kudos
2 More Replies
jeft
by New Contributor II
  • 797 Views
  • 2 replies
  • 0 kudos

mongodb ingest data into databricks error

spark = SparkSession.builder \.appName("MongoDBToDatabricks") \.config("spark.jars.packages", "org.mongodb.spark:mongo-spark-connector_2.12:10.4.0") \.config("spark.mongodb.read.connection.uri", mongodb_uri) \.config("spark.mongodb.write.connection.u...

  • 797 Views
  • 2 replies
  • 0 kudos
Latest Reply
Nam_Nguyen
Databricks Employee
  • 0 kudos

Hello @jeft , will you be able to share some screenshots of the driver logs?

  • 0 kudos
1 More Replies
Dom1
by New Contributor III
  • 5676 Views
  • 5 replies
  • 3 kudos

Show log4j messages in run output

Hi,I have an issue when running JAR jobs. I expect to see logs in the output window of a run. Unfortunately, I can only see messages of that are generated with "System.out.println" or "System.err.println". Everything that is logged via slf4j is only ...

Dom1_0-1713189014582.png
  • 5676 Views
  • 5 replies
  • 3 kudos
Latest Reply
dbal
New Contributor III
  • 3 kudos

Any update on this? I am also facing this issue.

  • 3 kudos
4 More Replies
Volker
by Contributor
  • 1675 Views
  • 4 replies
  • 0 kudos

Failed job with "A fatal error has been detected by the Java Runtime Environment"

Hi community,I have a question regarding an error that I get sometimes when running a job.# # A fatal error has been detected by the Java Runtime Environment: # # SIGSEGV (0xb) at pc=0x00007fc941e74996, pid=940, tid=0x00007fc892dff640 # # JRE versio...

  • 1675 Views
  • 4 replies
  • 0 kudos
Latest Reply
Volker
Contributor
  • 0 kudos

In the last run there has been additional information in the error message:# # A fatal error has been detected by the Java Runtime Environment: # # SIGSEGV (0xb) at pc=0x00007f168e094210, pid=1002, tid=0x00007f15dd1ff640 # # JRE version: OpenJDK Run...

  • 0 kudos
3 More Replies
LasseL
by New Contributor III
  • 6144 Views
  • 6 replies
  • 3 kudos

Resolved! The best practice to remove old data from DLT pipeline created tables

Hi, didn't find any "reasonable" way to clean old data from DLT pipeline tables. In DLT we have used materialized views and streaming tables (scd1, append only). What is the best way to delete old data from the tables (storage size increases linearly...

  • 6144 Views
  • 6 replies
  • 3 kudos
Latest Reply
TinasheChinyati
New Contributor III
  • 3 kudos

@LasseL 1. Enable Change Data Capture (CDC):Enable CDC before deleting data to ensure Delta tables track inserts, updates, and deletes. This allows downstream pipelines to handle deletions correctly. ALTER TABLE your_table SET TBLPROPERTIES ('delta.e...

  • 3 kudos
5 More Replies
Thor
by New Contributor III
  • 1174 Views
  • 1 replies
  • 1 kudos

Resolved! Asynchronous progress tracking with foreachbatch

Hello,currently the doc says that async progress tracking is available only for Kafka sink:https://docs.databricks.com/en/structured-streaming/async-progress-checking.htmlI would like to know if it would work for any sink that is "exactly once"?I exp...

  • 1174 Views
  • 1 replies
  • 1 kudos
Latest Reply
cgrant
Databricks Employee
  • 1 kudos

Asynchronous progress tracking is a feature designed for ultra low latency use cases. You can read more in the open source SPIP doc here, but the expected gain in time is in the hundreds of milliseconds, which seems insignificant when doing merge ope...

  • 1 kudos
Krishna2110
by New Contributor II
  • 630 Views
  • 1 replies
  • 0 kudos

Catalog Sample Data is not visible with all purpose cluster

Hi All,I need one help even i have the cluster access and i can able to run it attaching with the notebook, still when im going in catalog to see the sample data im able to see an error, Here is the error,@ipriyanksingh , FYRCan anyone please help us...

Krishna2110_0-1732639768443.png
  • 630 Views
  • 1 replies
  • 0 kudos
Latest Reply
Alberto_Umana
Databricks Employee
  • 0 kudos

Hi @Krishna2110, Based on the error are you using any token? Ensure that the access token is valid and has not expired.  Is your workspace Unity Catalog enabled? and which are your cluster settings to browse through the data?

  • 0 kudos
menonshiji
by New Contributor
  • 2184 Views
  • 1 replies
  • 0 kudos

#HelpPost for Azure Blob to Databricks connection.

Hi,There is a set of .csv/.txt files in the storage container ie Azure Blob Storage/ Azure Storage Gen 2. I would like to ingest the files to Databricks. Dataset,LinkedServices was created on both ends. Also an all purpose cluster was created in Bric...

  • 2184 Views
  • 1 replies
  • 0 kudos
Latest Reply
cgrant
Databricks Employee
  • 0 kudos

These errors occur when you are not authenticated / properly authorized to access the storage account. Ensure that you've set proper storage credential configurations, and that those credentials have proper access. Documentation here.

  • 0 kudos
jeremy98
by Honored Contributor
  • 1337 Views
  • 2 replies
  • 0 kudos

start another workflow waiting the completion of a job-run of the same workflow

Hello community,I'm using DABs I want to know if It is possible to configure the yaml file a logic that allows me to run a workflow if the previous job run is finished of the same workflow. Is it possible to do it? Do I need to create a task that che...

  • 1337 Views
  • 2 replies
  • 0 kudos
Latest Reply
Alberto_Umana
Databricks Employee
  • 0 kudos

Hello @jeremy98, Yes, it is possible to configure a YAML file to run a workflow only if the previous job run of the same workflow has finished. You can achieve this by defining dependencies between tasks within the workflow. You can specify task depe...

  • 0 kudos
1 More Replies
ctiwari7
by New Contributor II
  • 1367 Views
  • 2 replies
  • 0 kudos

Databricks workflow job

Hi team,I am trying to execute a workflow job which takes in a parameter as unique identifier. I am using this job parameter to push down to tasks. I was hoping if there is any way for me to use python uuid4() function to generate unique id every tim...

  • 1367 Views
  • 2 replies
  • 0 kudos
Latest Reply
Stefan-Koch
Valued Contributor II
  • 0 kudos

hi ctiwari7A possible way to do that, you create a python file which generates the uuid and then pass it to jobs.taskValues. This is described here: https://docs.databricks.com/en/jobs/task-values.html As test, I created a python file, with the follo...

  • 0 kudos
1 More Replies
ctiwari7
by New Contributor II
  • 2061 Views
  • 2 replies
  • 1 kudos

get job run link based on the job name or the submit body

This is the current code(ignore indentations) that I am using which takes the list of all the running jobs and then filters from the list to get the run id of the matching job name. I want to know if there is any better way to optimise this. Legacy d...

  • 2061 Views
  • 2 replies
  • 1 kudos
Latest Reply
ctiwari7
New Contributor II
  • 1 kudos

even the rest API also provides the job details based on the job id which I would need to get from the job_name that I have. This seems like the only possible solution since job_id is the true identifier of any workflow job considering we can have mu...

  • 1 kudos
1 More Replies
Isa1
by New Contributor III
  • 2271 Views
  • 6 replies
  • 3 kudos

Resolved! Moving existing Delta Live Table to Asset Bundle

Hi!I am creating an Asset Bundle, which also includes my streaming Delta Live Table Pipelines. I want to move these DLT pipelines to the Asset Bundle, without having to run my DLT streaming Pipeline on all historical files (this takes a lot of comput...

  • 2271 Views
  • 6 replies
  • 3 kudos
Latest Reply
Walter_C
Databricks Employee
  • 3 kudos

When you change the path to the notebook or the name of the pipeline in your Delta Live Table (DLT) pipeline, it can indeed cause issues. Specifically, changing the path to the notebook or the name of the pipeline can lead to the recreation of the pi...

  • 3 kudos
5 More Replies
shadowinc
by New Contributor III
  • 1115 Views
  • 1 replies
  • 2 kudos

Delete Partition Folders

Hello team, as DataBricks moved away from hive-style partitioning, we can see some 2-letter partition folders created. And I have observed that the vacuum doesn't delete these folders (even though they are empty). Is there any way to delete those usi...

Data Engineering
delta
vacuum
  • 1115 Views
  • 1 replies
  • 2 kudos
Latest Reply
Alberto_Umana
Databricks Employee
  • 2 kudos

Hello @shadowinc, VACUUM is used to clean up unused and stale data files that are no longer referenced by a Delta table and are older than a specified retention period (default is 7 days). It does not remove empty directories. I think manual cleanup ...

  • 2 kudos
Hubert-Dudek
by Databricks MVP
  • 19357 Views
  • 6 replies
  • 19 kudos

Resolved! Optimize and Vacuum - which is the best order of operations?

Optimize -> VacuumorVacuum -> Optimize

  • 19357 Views
  • 6 replies
  • 19 kudos
Latest Reply
shadowinc
New Contributor III
  • 19 kudos

What about ReOrg delta table https://learn.microsoft.com/en-us/azure/databricks/sql/language-manual/delta-reorg-tableDoes it help or make sense to add Re-org then Optimize - Vacuum every week?Reorganize a Delta Lake table by rewriting files to purge ...

  • 19 kudos
5 More Replies
ayush19
by New Contributor III
  • 942 Views
  • 2 replies
  • 0 kudos

Running a jar on Databricks shared cluster using Airflow

Hello,I have a requirement to run a jar already installed on a Databricks cluster. It needs to be orchestrated using Apache Airflow. I followed the docs for the operator which can be used to do so https://airflow.apache.org/docs/apache-airflow-provid...

  • 942 Views
  • 2 replies
  • 0 kudos
Latest Reply
Alberto_Umana
Databricks Employee
  • 0 kudos

Hello @ayush19, Here are some suggestions, but would need to check how are the parameters configured. Use an Existing Cluster: Instead of creating a new cluster each time, configure the DatabricksSubmitRunOperator to use an existing cluster. This can...

  • 0 kudos
1 More Replies

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!

Sign Up Now
Labels