cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Forum Posts

rimaissa
by New Contributor III
  • 1636 Views
  • 3 replies
  • 0 kudos

DLT apply_changes not accepting upsert

I have a DLT pipeline that has bronze -> silver -> gold -> platinum. I need to include a table that is joined to the gold layer for platinum that allows upserts in the DLT pipeline. This table is managed externally via Databricks API. Anytime a chang...

  • 1636 Views
  • 3 replies
  • 0 kudos
Latest Reply
Mike_Szklarczyk
Contributor
  • 0 kudos

You obtain this error: "Detected a data update in the source table at version 1. This is currently not supported...."becuse DLT is based on Structured Streaming.... and for Structured Streaming any changes (deletes, updates) in the source table are n...

  • 0 kudos
2 More Replies
alejandrofm
by Valued Contributor
  • 8509 Views
  • 8 replies
  • 10 kudos

Resolved! Pandas.spark.checkpoint() doesn't broke lineage

Hi, I'm doing some something simple on Databricks notebook:spark.sparkContext.setCheckpointDir("/tmp/")   import pyspark.pandas as ps   sql=("""select field1, field2 From table Where date>='2021-01.01""")   df = ps.sql(sql) df.spark.checkpoint()That...

  • 8509 Views
  • 8 replies
  • 10 kudos
Latest Reply
annafina
New Contributor II
  • 10 kudos

checkpoint() returns a checkpointed DataFrame, so you need to assign it to a new variable:checkpointedDF = df.checkpoint()

  • 10 kudos
7 More Replies
OldManCoder
by New Contributor II
  • 929 Views
  • 1 replies
  • 1 kudos

Resolved! Oracle DB connection works in single-user compute but not shared compute

I can connect to an onprem oracle DB using my single-user compute but when i switch over to a shared compute, I get invalid username/password error.  I can connect to my onprem singlestore db using either compute so not sure why oracle would be diffe...

  • 929 Views
  • 1 replies
  • 1 kudos
Latest Reply
Walter_C
Databricks Employee
  • 1 kudos

Based on internal research I found that shared access mode does not currently support Oracle JDBC connector only in Single/Assigned access mode. There is a feature request to include Oracle connector as part of the Lakehouse Federation. Once it is in...

  • 1 kudos
nakaxa
by New Contributor
  • 38664 Views
  • 4 replies
  • 0 kudos

Fastest way to write a Spark Dataframe to a delta table

I read a huge array with several columns into memory, then I convert it into a spark dataframe,  when I want to write to a delta table it using the following command it takes forever (I have a driver with large memory and 32 workers) : df_exp.write.m...

  • 38664 Views
  • 4 replies
  • 0 kudos
Latest Reply
Reiska
New Contributor II
  • 0 kudos

The answers here are not correct.TLDR: _After_ the Spark DF is materialized, saveAsTable takes ages. 35seconds for 1million rows.saveAsTable() is SLOW - terribly so. Why? Would be nice to get an answer. The workaround is to avoid spark for delta - no...

  • 0 kudos
3 More Replies
JissMathew
by Valued Contributor
  • 3967 Views
  • 12 replies
  • 8 kudos

Resolved! Reading a csv file

while try to read a csv file using data frame , read csv using a  file format , but fail in case of formatting and column error while loading how the data in databricks ,the code i used fordf = spark.read.format("csv") \    .option("header", "true") ...

Screenshot 2024-11-14 172800.png Screenshot 2024-11-14 172727.png
  • 3967 Views
  • 12 replies
  • 8 kudos
Latest Reply
Mike_Szklarczyk
Contributor
  • 8 kudos

You can try add multiline option: df = ( spark.read.format("csv") .option("header", "true") .option("quote", '"') .option("delimiter", ",") .option("nullValue", "") .option("emptyValue", "NULL") .option("multiline", True) .schema(schem...

  • 8 kudos
11 More Replies
184754
by New Contributor II
  • 1142 Views
  • 2 replies
  • 2 kudos

Table Trigger - Too many logfiles

Hi, we have implemented a job that runs on a trigger of a table update. The job worked perfectly, until the source table now have accumulated too many log files and the job isn't running anymore. Only the error message below:Storage location /abcd/_d...

  • 1142 Views
  • 2 replies
  • 2 kudos
Latest Reply
radothede
Valued Contributor II
  • 2 kudos

Hi @184754 Interesting topic, as the docs says:"Log files are deleted automatically and asynchronously after checkpoint operations and are not governed by VACUUM. While the default retention period of log files is 30 days, running VACUUM on a table r...

  • 2 kudos
1 More Replies
ameet9257
by Contributor
  • 2014 Views
  • 2 replies
  • 1 kudos

Cloning of Workflow from One env to different env using Job API

Hi Team,One of my team members recently shared one requirement: he wants to migrate the 10 Workflows from the sandbox to the dev environment to run his model in dev env.I wanted to move all these workflows in an automated way and one of the solutions...

ameet9257_0-1732126423665.png ameet9257_1-1732126508104.png ameet9257_2-1732126684331.png
  • 2014 Views
  • 2 replies
  • 1 kudos
Latest Reply
ameet9257
Contributor
  • 1 kudos

@Stefan-Koch Thanks. This looks interesting and I will try this. 

  • 1 kudos
1 More Replies
NehaR
by New Contributor III
  • 824 Views
  • 2 replies
  • 2 kudos

Is there any option in databricks to estimate cost of a query before execution

Hi Team,I want to check if there is any option in data bricks which can help to estimate cost of a query before execution?I mean calculate DBU before actual query execution based on the logical plan? Regards 

  • 824 Views
  • 2 replies
  • 2 kudos
Latest Reply
NehaR
New Contributor III
  • 2 kudos

Is there any way to track the progress or ETA?Do we have access to ideas portal? Where can we search this reference number DB-I-5730? 

  • 2 kudos
1 More Replies
jeremy98
by Honored Contributor
  • 3723 Views
  • 2 replies
  • 2 kudos

Ways to write fast millions of rows inside a new delta table

Hello everyone,I am facing an issue with writing 100–500 million rows (partitioned by a column) into a newly created Delta table. I have set up a cluster with 256 GB of memory and 64 cores. However, the following code takes a considerable amount of t...

  • 3723 Views
  • 2 replies
  • 2 kudos
Latest Reply
radothede
Valued Contributor II
  • 2 kudos

Hi @jeremy98 This is what I would suggest to test:1) remove repartition step or reduce number or partitions (start with number of cores and then try to increase it x2, x3) repartition(num_partitions*4, partition_col) I know repartitioning helps to di...

  • 2 kudos
1 More Replies
joeyslaptop
by New Contributor II
  • 899 Views
  • 1 replies
  • 0 kudos

Resolved! How do I use DataBricks SQL query to convert a field value % back into a wildcard?

Hi.  If I've posted to the wrong area, please let me know.I am using SQL to join two tables.  One table has the wildcard '%' stored as text/string/varchar.  I need to join the value of TableA.column1 to TableB.column1 based on the wildcard in the str...

  • 899 Views
  • 1 replies
  • 0 kudos
Latest Reply
JAHNAVI
Databricks Employee
  • 0 kudos

Hi,Could you please try the query below and let me know if it meets your requirements? SELECT * FROM TableA A LEFT JOIN TableB B ON A.Column1 LIKE REPLACE(B.Column1, '%', '%%')Replace helps us in treating the %' stored in TableB.Column1 as a wildcar...

  • 0 kudos
swetha
by New Contributor III
  • 4444 Views
  • 4 replies
  • 1 kudos

Error: no streaming listener attached to the spark app is the error we are observing post accessing streaming statistics API. Please help us with this issue ASAP. Thanks.

Issue: Spark structured streaming applicationAfter adding the listener jar file in the cluster init script, the listener is working (From what I see in the stdout/log4j logs)But when I try to hit the 'Content-Type: application/json' http://host:port/...

  • 4444 Views
  • 4 replies
  • 1 kudos
Latest Reply
INJUSTIC
New Contributor II
  • 1 kudos

Have you found the solution? Thanks

  • 1 kudos
3 More Replies
swetha
by New Contributor III
  • 4485 Views
  • 3 replies
  • 1 kudos

I am unable to attach a streaming listener to a spark streaming job. Error: no streaming listener attached to the spark application is the error we are observing post accessing streaming statistics API. Please help us with this issue ASAP. Thanks.

Issue:After adding the listener jar file in the cluster init script, the listener is working (From what I see in the stdout/log4j logs)But when I try to hit the 'Content-Type: application/json' http://host:port/api/v1/applications/app-id/streaming/st...

  • 4485 Views
  • 3 replies
  • 1 kudos
Latest Reply
INJUSTIC
New Contributor II
  • 1 kudos

Have you found the solution? Thanks

  • 1 kudos
2 More Replies
dbuschi
by New Contributor II
  • 1271 Views
  • 2 replies
  • 0 kudos

Resolved! Delta Live Tables: How does it identify new files?

Hi,I'm importing large numbers of parquet files (ca 5200 files per day, they each land in a separate folder) into Azure ADLS storage.I have a DLT streaming table reading from the root folder.I noticed a massive spike in storage account costs due to f...

  • 1271 Views
  • 2 replies
  • 0 kudos
Latest Reply
dbuschi
New Contributor II
  • 0 kudos

To resolve the issue of excessive directory scanning, I have changed the folder structure to separate historical files from current files and reduce the number of folders and files that the Databrick process monitors.

  • 0 kudos
1 More Replies
KuruDev
by New Contributor II
  • 2312 Views
  • 3 replies
  • 0 kudos

Databricks Asset Bundle - Not fully deploying in Azure Pipeline

 Hello Community, I'm encountering a challenging issue with my Azure Pipeline and I'm hoping someone here might have some insights. I'm attempting to deploy a Databricks bundle that includes both notebooks and workflow YAML files. When deploying the ...

  • 2312 Views
  • 3 replies
  • 0 kudos
Latest Reply
adfo
New Contributor II
  • 0 kudos

Hello,Same issue here, files and wheel are deployed and present on the databricks workspace but the jobs are not created

  • 0 kudos
2 More Replies
TheoDeSo
by New Contributor III
  • 17341 Views
  • 8 replies
  • 5 kudos

Resolved! Error on Azure-Databricks write output to blob storage account

Hello,After implementing the use of Secret Scope to store Secrets in an azure key vault, i faced a problem.When writting an output to the blob i get the following error:shaded.databricks.org.apache.hadoop.fs.azure.AzureException: Unable to access con...

  • 17341 Views
  • 8 replies
  • 5 kudos
Latest Reply
nguyenthuymo
New Contributor III
  • 5 kudos

Hi all,Is it correct that Azure-Databricks only support to write data to Azure Data Lake Gen2 and does not support for Azure Storage Blob (StorageV2 - general purpose) ?In my case, I can read the data from Azure Storage Blob (StorageV2 - general purp...

  • 5 kudos
7 More Replies

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!

Sign Up Now
Labels