cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 
Data + AI Summit 2024 - Data Engineering & Streaming

Forum Posts

brickster_2018
by Databricks Employee
  • 3427 Views
  • 1 replies
  • 0 kudos
  • 3427 Views
  • 1 replies
  • 0 kudos
Latest Reply
brickster_2018
Databricks Employee
  • 0 kudos

The off-heap memory is managed outside the executor JVM. Spark has native support to use off-heap memory. The off-heap memory is managed by Spark and not controlled by the executor JVM. Hence GC cycles on the executor do not clean up off-heap. Databr...

  • 0 kudos
brickster_2018
by Databricks Employee
  • 1068 Views
  • 1 replies
  • 1 kudos
  • 1068 Views
  • 1 replies
  • 1 kudos
Latest Reply
brickster_2018
Databricks Employee
  • 1 kudos

At a high-level VACUUM operation on a Delta table has 2 steps. 1) Identifying the stale files based on the VACUUM command triggered. 2) Deleting the files identified in Step 1The #1 is performed by triggering a Spark job hence utilizes the resource o...

  • 1 kudos
User16826994223
by Honored Contributor III
  • 1110 Views
  • 1 replies
  • 0 kudos

Even the Unfinished Experiment in Mlflow is getting saved as finished

when I start the experiment with mlflow.start_run(),even if my script is interrupted or failed before executing mlflow.end_run() ,the run gets tagged as finished instead of unfinished , Can any one help why it is happening here

  • 1110 Views
  • 1 replies
  • 0 kudos
Latest Reply
User16826994223
Honored Contributor III
  • 0 kudos

In note book the mlflow tagas ias the command travels and once failed or exit there itself it logs and finishes the experiment even if the noteboolsfails. However, if you want to continue logging metrics or artifacts to that run, you just need to use...

  • 0 kudos
brickster_2018
by Databricks Employee
  • 1092 Views
  • 1 replies
  • 0 kudos

Resolved! Why is my streaming job not resuming even though I specified checkpoint directory

I have provided the checkpointLocation as below, however I see the config is ignored for my streaming queryoption("checkpointLocation", "path/to/checkpoint/dir")

  • 1092 Views
  • 1 replies
  • 0 kudos
Latest Reply
brickster_2018
Databricks Employee
  • 0 kudos

This is a common question from many users. If the streaming checkpoint directory is specified correctly then this behavior is expected. Below is an example of specifying the checkpoint correctlydf.writeStream   .format("parquet")   .option("checkpo...

  • 0 kudos
brickster_2018
by Databricks Employee
  • 1126 Views
  • 1 replies
  • 0 kudos

Resolved! Is there any way to control the autoOptimize interval?

I can see my streaming jobs running optimize jobs more frequently, Is there any property so I can control autoOptimize duration

  • 1126 Views
  • 1 replies
  • 0 kudos
Latest Reply
brickster_2018
Databricks Employee
  • 0 kudos

The autoOptimize is not performed on a time basis. It's an event-based trigger. Once the delta table/partition has 50 (default value of spark.databricks.delta.autoCompact.minNumFiles) files, auto-compaction is triggered. To reduce the frequency, inc...

  • 0 kudos
User16826994223
by Honored Contributor III
  • 2730 Views
  • 1 replies
  • 0 kudos

How to change the time zone in notebook ,

How to change the time zone in notebook ,

  • 2730 Views
  • 1 replies
  • 0 kudos
Latest Reply
User16826994223
Honored Contributor III
  • 0 kudos

 import java.util.TimeZone spark.conf.set("spark.sql.session.timeZone", "Asia/Calcutta")TimeZone.setDefault(TimeZone.getTimeZone("Asia/Calcutta"))Scalaimport java.timeval s: String = time.LocalDateTime.now().toStringprintln(s)sql %sqlselect current_t...

  • 0 kudos
brickster_2018
by Databricks Employee
  • 1595 Views
  • 1 replies
  • 0 kudos

Resolved! How Can I update the DBR versions of all my jobs in one go?

I keep it a point to use the latest DBR versions for my workloads and mostly we leverage those new features. But I have 300 jobs on the Databricks workspace and updating the DBR versions for each job manually is difficult to do. Any quick hack

  • 1595 Views
  • 1 replies
  • 0 kudos
Latest Reply
brickster_2018
Databricks Employee
  • 0 kudos

Below code snippet can be helpful if you are using Databricks CLIfor jobid in `databricks jobs list | awk '{print $1}'`; do databricks jobs get --job-id $jobid | jq .settings > /tmp/jobs/$jobid.json; done sed -i 's/"spark_version": ".*"/"spark_ver...

  • 0 kudos
User16790091296
by Contributor II
  • 7090 Views
  • 1 replies
  • 0 kudos

How to add a new datetime column to a spark dataFrame from existing timestamp column

I have a data frame in Spark that has a column timestamp. I want to add a new column to this data frame that has the DateTime in the below format created from this existing timestamp column.“YYYY-MM-DD HH:MM:SS”

  • 7090 Views
  • 1 replies
  • 0 kudos
Latest Reply
Srikanth_Gupta_
Valued Contributor
  • 0 kudos

val df = Seq(("2021-11-05 02:46:47.154410"),("2019-10-05 2:46:47.154410")).toDF("old_column")display(df)import org.apache.spark.sql.functions._val df2 = df.withColumn("new_column", from_unixtime(unix_timestamp(col("old_column"), "yyyy-MM-dd HH:mm:ss....

  • 0 kudos
brickster_2018
by Databricks Employee
  • 1332 Views
  • 1 replies
  • 0 kudos

Resolved! What is the trade-off of using an unsupported DBR version on my cluster?

I do not want to upgrade my cluster every one month. I am looking for stability over new features.

  • 1332 Views
  • 1 replies
  • 0 kudos
Latest Reply
brickster_2018
Databricks Employee
  • 0 kudos

The strong recommendation is not to use an unsupported version of DBR on your cluster. For production workloads where you don't welcome newer versions, then check the Databricks LTS DBR versions. if using an unsupported version then you don't receiv...

  • 0 kudos
brickster_2018
by Databricks Employee
  • 1691 Views
  • 1 replies
  • 0 kudos

Resolved! Getting file permission issues even though I have the right IAM role attached

I am reading data from S3 from a Databricks cluster and the read operation seldom fails with 403 permission errors. Restarting the cluster fixes my issue.

  • 1691 Views
  • 1 replies
  • 0 kudos
Latest Reply
brickster_2018
Databricks Employee
  • 0 kudos

The main reason for this behavior is : AWS keys are used in addition to the IAM role. Using global init scripts to set the AWS keys can cause this behavior.The IAM role has the required permission to access the S3 data, but AWS keys are set in the Sp...

  • 0 kudos
brickster_2018
by Databricks Employee
  • 1645 Views
  • 1 replies
  • 0 kudos

Resolved! Why do I see data loss with Structured streaming jobs?

I have a Spark structured streaming job reading data from Kafka and loading it to the Delta table. I have some transformations and aggregations on the streaming data before writing to Delta table

  • 1645 Views
  • 1 replies
  • 0 kudos
Latest Reply
brickster_2018
Databricks Employee
  • 0 kudos

The typical reason for data loss on a Structured streaming application is having an incorrect value set for watermarking. The watermarking is done to ensure the application does not develop the state for a long period, However, it should be ensured ...

  • 0 kudos
brickster_2018
by Databricks Employee
  • 997 Views
  • 1 replies
  • 0 kudos

Resolved! Does Table ACL support column-level security like Ranger?

I have used Ranger in Apache Hadoop and it works fine for my use case. Now that I am migrating my workloads from Apache Hadoop to Databricks

  • 997 Views
  • 1 replies
  • 0 kudos
Latest Reply
brickster_2018
Databricks Employee
  • 0 kudos

Currently, Table ACL does not support column-level security. There are several tools like Privcera which has better integration with Databricks.

  • 0 kudos
User16752240150
by New Contributor II
  • 5567 Views
  • 1 replies
  • 0 kudos

When to use cache vs checkpoint?

I've seen .cache() and .checkpoint() used similarly in some workflows I've come across. What's the difference, and when should I use one over the other?

  • 5567 Views
  • 1 replies
  • 0 kudos
Latest Reply
Srikanth_Gupta_
Valued Contributor
  • 0 kudos

Caching is extremely useful than checkpointing when you have lot of available memory to store your RDD or Dataframes if they are massive.Caching will maintain the result of your transformations so that those transformations will not have to be recomp...

  • 0 kudos

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group
Labels