cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Forum Posts

hiryucodes
by Databricks Employee
  • 2414 Views
  • 6 replies
  • 4 kudos

ModuleNotFound when running DLT pipeline

My new DLT pipeline gives me a ModuleNotFound error when I try to request data from an API. For some more context, I develop in my local IDE and then deploy to databricks using asset bundles. The pipeline runs fine if I try to write a static datafram...

  • 2414 Views
  • 6 replies
  • 4 kudos
Latest Reply
AFH
New Contributor II
  • 4 kudos

Same problem here!

  • 4 kudos
5 More Replies
Firehose74
by New Contributor III
  • 2405 Views
  • 1 replies
  • 0 kudos

Duplicates detected in transformed data - Help with troubleshooting

HelloCan anyone help with an error I am getting when running ADF. An ingestion pipeline fails and when I click on the link I am taken to a Databricks error message "7 duplicates detected in transformed data". However, when I run the transformation ce...

  • 2405 Views
  • 1 replies
  • 0 kudos
Latest Reply
Sidhant07
Databricks Employee
  • 0 kudos

Hi @Firehose74 , This may need a deeper investigation and require workspace access to troubleshoot/review logs. Can you please raise a ticket with us?

  • 0 kudos
Sadam97
by New Contributor III
  • 733 Views
  • 1 replies
  • 0 kudos

cancel running job kill the parent process and does not wait for streamings to stop

Hi,We have created databricks jobs and each has multiple tasks. Each task is 24/7 running streaming with checkpoint enabled. We want it to be stateful when cancel and run the job but it seems like, when we cancel the job run it kill the parent proces...

  • 733 Views
  • 1 replies
  • 0 kudos
Latest Reply
Sidhant07
Databricks Employee
  • 0 kudos

Hi @Sadam97 , This seems to be expected behaviour. If you are running the jobs in a job cluster: In job clusters, the Databricks job scheduler treats all streaming queries within a task as belonging to the same job execution context. If any query fai...

  • 0 kudos
mkEngineer
by New Contributor III
  • 833 Views
  • 6 replies
  • 2 kudos

How to preserve job run history when deploying with DABs

HiI’m having an issue when deploying jobs with DABs. Each time I deploy changes, the existing job gets overwritten,  the job name stays the same, but a new job ID is created. This causes the history of past runs to be lost.Ideally, I’d like to update...

  • 833 Views
  • 6 replies
  • 2 kudos
Latest Reply
Coffee77
Contributor III
  • 2 kudos

Despite using different keys but same names, original jobs should remain indeed unless destroying them

  • 2 kudos
5 More Replies
echozhuoocl
by New Contributor II
  • 430 Views
  • 2 replies
  • 0 kudos

delta sharing presigned url was removed, what should I do?

Caused by: java.lang.IllegalStateException: table s3a://dmsa/tmp/the_credential_of_deltasharing/on_prem_deltasharing.share#on-prem-delta-sharing.dmsa_in_nrt.shp_rating_snapshot was removed   at org.apache.spark.delta.sharing.CachedTableManager.getPre...

  • 430 Views
  • 2 replies
  • 0 kudos
Latest Reply
szymon_dybczak
Esteemed Contributor III
  • 0 kudos

Hi @echozhuoocl ,Did you VACCUM your table? If you're not sure, run:DESCRIBE HISTORY catalog.schema.table

  • 0 kudos
1 More Replies
Puru20
by New Contributor III
  • 751 Views
  • 3 replies
  • 6 kudos

Resolved! Pass the job even if specific task fails

Hi , I have multiple data pipelines and each has data quality check as a final task which runs on dbt. There are 1500 test cases altogether runs everyday which is being captured on dashboard. Is there a way to pass the job even if this particular tal...

  • 751 Views
  • 3 replies
  • 6 kudos
Latest Reply
Puru20
New Contributor III
  • 6 kudos

Hi @szymon_dybczak  The solution works perfectly when I set leaf job to pass irrespective of dbt test task status. Thanks much!

  • 6 kudos
2 More Replies
ismaelhenzel
by Contributor III
  • 545 Views
  • 1 replies
  • 0 kudos

Schema Evolution/Type Widening in Materialized Views

My team is migrating pipelines from Spark to Delta Live Tables (DLT), but we've found that some important features, like schema evolution for tables with enforced schemas, seem to be missing. In DLT, we can define schemas, set primary and foreign key...

  • 545 Views
  • 1 replies
  • 0 kudos
Latest Reply
nayan_wylde
Esteemed Contributor
  • 0 kudos

DLT supports schema evolution, but changing column data types (like from DECIMAL(10,5) to DECIMAL(11,5)) is not automatically handled. Here's how you can manage it:Option 1: Full Refresh with Schema UpdateIf you're okay with refreshing the materializ...

  • 0 kudos
zc
by New Contributor III
  • 5429 Views
  • 9 replies
  • 7 kudos

Resolved! Use Array in WHERE IN clause

This is what I'm trying to do using SQL: create table check1 asselect * from dataAwhere IDs in ('12483258','12483871','12483883'); The list of IDs is much longer and may be changed so I want to use a variable for that. This is what I have tried decla...

  • 5429 Views
  • 9 replies
  • 7 kudos
Latest Reply
BS_THE_ANALYST
Esteemed Contributor III
  • 7 kudos

Nice solutions! @ManojkMohan @WiliamRosa I love the use of the temp view for the intermediate result. The array_contains is also a really nice touch. @ManojkMohan when you write "SET VARIABLE ids = ARRAY('12483258','12483871','12483883');" ... can th...

  • 7 kudos
8 More Replies
drag7ter
by Contributor
  • 11318 Views
  • 7 replies
  • 4 kudos

Resolved! foreachBatch doesn't work in structured streaming

I' m trying to print out number of rows in the batch, but seems it doesn't work properly. I have 1 node compute optimized cluster and run in notebook this code:# Logging the row count using a streaming-friendly approach def log_row_count(batch_df, ba...

Capture.PNG
  • 11318 Views
  • 7 replies
  • 4 kudos
Latest Reply
saffovski
New Contributor II
  • 4 kudos

Hi, I am facing the exact same error. The method that I'm calling in the foreachBatch is just a very simple print statement that test whether the method is called or no, and the print is not printed out. Here's a code snippet:def debug_batch(batch_df...

  • 4 kudos
6 More Replies
Rainier_dw
by New Contributor III
  • 1554 Views
  • 6 replies
  • 6 kudos

Resolved! Rollbacks/deletes on streaming table

Hi all — I’m running a Medallion streaming pipeline on Databricks using DLT (bronze → staging silver view → silver table). I ran into an issue and would appreciate any advice or best practices.What I’m doingIngesting streaming data into a streaming b...

  • 1554 Views
  • 6 replies
  • 6 kudos
Latest Reply
dalcuovidiu
New Contributor III
  • 6 kudos

I'm not entirely sure if I’m missing something here, but as far as I know there’s a golden rule in DWH applications: you never hard delete records, you use soft deletes instead. So I’m a bit puzzled why a hard delete is being used in this case.

  • 6 kudos
5 More Replies
ChingizK
by New Contributor III
  • 2857 Views
  • 5 replies
  • 2 kudos

Exclude a job from bundle deployment in PROD

My question is regarding Databricks Asset Bundles. I have defined a databricks.yml file the following way: bundle: name: my_bundle_name include: - resources/jobs/*.yml targets: dev: mode: development default: true workspace: ...

  • 2857 Views
  • 5 replies
  • 2 kudos
Latest Reply
Coffee77
Contributor III
  • 2 kudos

Me too. No clean solution yet. As workaround I implemented first an "extra" control in specific jobs that never should be run in PROD to block execution based on environment variable in all clusters (I don't really like much but it was effective). As...

  • 2 kudos
4 More Replies
Datalight
by Contributor
  • 1294 Views
  • 10 replies
  • 3 kudos

Resolved! High Level Design for Transfer Data from One Databricks account to Another databricks account

Hi,May someone please help me with only Points which should be part of High Level Design and Low Level Design when transfering Data from One Databricks account to Another databricks account using Unity Catalog. First time full data transfer and than ...

  • 1294 Views
  • 10 replies
  • 3 kudos
Latest Reply
Coffee77
Contributor III
  • 3 kudos

Based on my previous reply, you can use DEEP CLONE to clone data incrementally between workspaces by including it in a scheduled job but this will not work in real time indeed.

  • 3 kudos
9 More Replies
dalcuovidiu
by New Contributor III
  • 2197 Views
  • 11 replies
  • 10 kudos

DLT - SCD 2 - detect deletes

Hello,I have a question related to APPLY AS DELETE WHEN...If the source table does not have a column that specifies whether a record was deleted, I am currently using a workaround by ingesting synthetic data with a soft_deletion flag. In the future, ...

  • 2197 Views
  • 11 replies
  • 10 kudos
Latest Reply
dalcuovidiu
New Contributor III
  • 10 kudos

ok. In my case I am qualified for: Incremental without a delete flag (classic case)Generate synthetic tombstones via an anti-join between the current set of keys and the target’s active keys.I don't want to use Merge, that's why my question was for C...

  • 10 kudos
10 More Replies
BS_THE_ANALYST
by Esteemed Contributor III
  • 1764 Views
  • 2 replies
  • 3 kudos

Resolved! Opinions/Thoughts: SQL Best Practices in Production .. DBT vs DLT ?

Hey everyone, I'd like to hear the experiences of the community on DLT (Lakeflow declarative pipelines) vs DBT. 1. Why would one choose one instead of the other? 2. How does picking one of these level up your SQL strategy?I am somebody who's well-ver...

  • 1764 Views
  • 2 replies
  • 3 kudos
Latest Reply
BS_THE_ANALYST
Esteemed Contributor III
  • 3 kudos

That's a cracking write up @szymon_dybczak. Thanks for that .That's certainly given me some food for thought. I think the safest option here, at least for me, is digging into both of them. I feel better informed moving forward with this. I'd love to ...

  • 3 kudos
1 More Replies
SanneJansen564
by Contributor
  • 1160 Views
  • 11 replies
  • 5 kudos

Ensuring Row Order When Importing CSV with COPY INTO

Hi everyone,I have a CSV file stored in S3, and it's critical for my process that the rows are loaded in the exact order they appear in the file.Does the COPY INTO command preserve the original row order during the load? I need to make sure the bronz...

  • 1160 Views
  • 11 replies
  • 5 kudos
Latest Reply
WiliamRosa
Contributor III
  • 5 kudos

tks @SanneJansen564 

  • 5 kudos
10 More Replies

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!

Sign Up Now
Labels