Data Engineering

Forum Posts

Sorted by:

by RabahO • New Contributor II

12-26-2023 6:45:12 AM

542 Views
1 replies
0 kudos

Handling data close to SCD2 with Delta tables

Hello, stack used: pyspark and delta tablesI'm working with some data that look a bit like SCD2 data.Basically, the data has columns that represent an id, a rank column and other informations, here's an example:login, email, business_timestamp => the...

Data Engineering

542 Views
1 replies
0 kudos

12-26-2023 6:45:12 AM

View Replies

Latest Reply

Wojciech_BUK
Contributor III

12-26-2023 2:19:43 PM

0 kudos

Your problem is exactly like SCD2 . You just add one more column with valid to date ( optionals you can add flag is actual to tag current records)You can use DLT apply changes syntax. Alternatively Merge statement .On the top of that table you can bu...

0 kudos

12-26-2023 2:19:43 PM

by Databricks_POC • New Contributor II

12-20-2021 1:14:14 AM

13611 Views
6 replies
6 kudos

Resolved! I want to compare two data frames. In output I wish to see unmatched Rows and the columns identified leading to the differences.

Data Engineering

13611 Views
6 replies
6 kudos

12-20-2021 1:14:14 AM

View Replies

Latest Reply

bhargavi1
New Contributor II

04-28-2022 1:53:19 AM

6 kudos

@vinita shinde are you Cracked this Code?

6 kudos

04-28-2022 1:53:19 AM

5 More Replies

by lorenz • New Contributor III

06-28-2023 7:21:26 AM

5033 Views
3 replies
1 kudos

Resolved! Databricks approaches to CDC

I'm interested in learning more about Change Data Capture (CDC) approaches with Databricks. Can anyone provide insights on the best practices and recommendations for utilizing CDC effectively in Databricks? Are there any specific connectors or tools ...

Data Engineering

5033 Views
3 replies
1 kudos

06-28-2023 7:21:26 AM

View Replies

Latest Reply

jcozar
Contributor

12-26-2023 5:50:08 AM

1 kudos

Hi, first of all thank you all in advance! I am very interested on this topic!My question is beyond what it is described here. As well as @Pektas , I am using debezium to send data from Postgres to a Kafka topic (in fact, Azure EventHub). My question...

1 kudos

12-26-2023 5:50:08 AM

2 More Replies

by Aidin • New Contributor II

12-22-2023 10:57:23 AM

3316 Views
4 replies
0 kudos

BINARY data type

Hello everyone.I'm trying to understand how BINARY data type works in spark sql. According to examples in the documentation, using cast or literal 'X' should return HEX representation of the binary data type, but when I try the same code, I see base6...

Data Engineering

3316 Views
4 replies
0 kudos

12-22-2023 10:57:23 AM

View Replies

Latest Reply

Wojciech_BUK
Contributor III

12-23-2023 9:07:20 AM

0 kudos

If you are confused , please look at this thread, they explain that Databricks use base64 as binary default. This is not documented but can be tracked at source code level.https://stackoverflow.com/questions/75753311/not-getting-binary-value-in-datab...

0 kudos

12-23-2023 9:07:20 AM

3 More Replies

by sahesh1320 • New Contributor

12-22-2023 9:16:35 AM

301 Views
1 replies
0 kudos

Shutdown Cluster in script if there is any failure

I am working on incremental load from sql server to Delta lake tables stored in ADLS gen2. DUring the script i need to qrite a logic toShut down the DB cluster on failure (there needs to be logging added to ensure that shutdown happens promptly to pr...

Data Engineering

301 Views
1 replies
0 kudos

12-22-2023 9:16:35 AM

View Replies

Latest Reply

Wojciech_BUK
Contributor III

12-22-2023 11:00:28 AM

0 kudos

If you run your notebook via workflow and error happen and there are no retires on job, then job cluster will be terminated immidietly after failure.You can add python block of try catch and if error occurs , you catch the error and log somewhere bef...

0 kudos

12-22-2023 11:00:28 AM

by dbx-user7354 • New Contributor III

12-22-2023 12:17:07 AM

466 Views
1 replies
0 kudos

Remove description from job

How do I remove a description from a job completely? When I try to just remove the text in the edit window, the same text shows up afterwards, even though it says "Successfully updated job". Also I had to write this twice, because on the first try I ...

Data Engineering

466 Views
1 replies
0 kudos

12-22-2023 12:17:07 AM

View Replies

Latest Reply

Wojciech_BUK
Contributor III

12-22-2023 6:36:20 AM

0 kudos

Hi,this is not possible form UI You have to replace content with e.g. white space. I think this is bug.But you can do it using job api !Below example in PowerShell, just reaplce:job_istokenworkspaceURL$body = @' { "job_id": 123456789, "new_setti...

0 kudos

12-22-2023 6:36:20 AM

by ksenija • New Contributor III

12-21-2023 6:45:20 AM

1492 Views
5 replies
5 kudos

How to change cluster size using a script

I want to change instance type or number of max workers via a python script. Does anyone know how to do it/is it possible? I have a lot of background jobs when I want to scale down my workers, so autoscaling is not an option. I was getting an error t...

Data Engineering

1492 Views
5 replies
5 kudos

12-21-2023 6:45:20 AM

View Replies

Latest Reply

Wojciech_BUK
Contributor III

12-21-2023 3:12:22 PM

5 kudos

Hi ksenija, this is just my guess but maybe you are using Cluster Policy in your cluster that only allows you to use specific cluster size ? E.g. below cluster policy that limits to some cluster sizes only.

5 kudos

12-21-2023 3:12:22 PM

4 More Replies

by SamGreene • Contributor

12-21-2023 2:33:13 PM

1788 Views
6 replies
3 kudos

Change DLT table type from streaming to 'normal'

I have a DLT streaming live table, and after watching a QA session, I saw that it is advised to only use streaming tables for your raw landing. I attempted to modify my pipeline to have my silver table be a regular LIVE TABLE, but an error was throw...

Data Engineering

1788 Views
6 replies
3 kudos

12-21-2023 2:33:13 PM

View Replies

Latest Reply

quakenbush
Contributor

12-22-2023 12:55:09 AM

3 kudos

Just curious, could you point me to said QA session if it's a video or something? I'm not aware of such a limitation. You can use DLT's live streaming tables anywhere in the Medallion architecture, just make sure not to break stream composability by ...

3 kudos

12-22-2023 12:55:09 AM

5 More Replies

by quakenbush • Contributor

12-21-2023 5:05:39 AM

770 Views
2 replies
0 kudos

Delta Lake, CFD & SCD2

HiWhat's the best way to deal with SCD2-styled tables in silver and/or gold layer while streaming.From what I've seen in the Professional Data Engineer videos, they usually go for SCD1 tables (simple updates or deletes)In a SCD2 scenario, we need to ...

Data Engineering

770 Views
2 replies
0 kudos

12-21-2023 5:05:39 AM

View Replies

Latest Reply

quakenbush
Contributor

12-22-2023 12:10:05 AM

0 kudos

I did some further reading and got the same conclusion. APPLY CHANGES might to the trick. However, I don't like the limitations. From Bronze to Silver I might need .foreachBatch to implement the JSON-logic and the attribute names (__start_at / __end_...

0 kudos

12-22-2023 12:10:05 AM

1 More Replies

by lena1 • New Contributor

12-21-2023 12:17:03 AM

425 Views
1 replies
0 kudos

Resource exhaustion when using default apply_changes python functionality

Hello!We are currently setting up streaming CDC pipelines for more than 500 tables. Due to the high number of tables, we split our tables into multiple pipelines, we use multiple DLT pipelines per layer: bronze, silver goldIn silver, we only upsert ...

Data Engineering

425 Views
1 replies
0 kudos

12-21-2023 12:17:03 AM

View Replies

Latest Reply

Wojciech_BUK
Contributor III

12-21-2023 3:45:37 PM

0 kudos

Hi Lena1,there is no magic behind the scene.If you write readstream from bronze table and writestream with ForEachBatch(function) and in function you will write MERGE stattemnt this will have similiar performance.Maybe there is a lot of shuffeling ha...

0 kudos

12-21-2023 3:45:37 PM

by Long_Tran • New Contributor

12-21-2023 2:47:40 AM

497 Views
1 replies
0 kudos

Can job 'run_as' be assigned to users/principals who actually run it?

Can job 'run_as' be assigned to users/principals who actually run it? instead of always a fixed creator/user/pricipal?When a job is run, I would like to see in the job setting "run_as" the name of the actual user/principal who runs it.Currently, "run...

Data Engineering

497 Views
1 replies
0 kudos

12-21-2023 2:47:40 AM

View Replies

Latest Reply

Wojciech_BUK
Contributor III

12-21-2023 3:24:26 PM

0 kudos

This is not avaliable in Workflow/Jobs.Job should newer be run as person who is executing the job, especialy in Production.The reason is that the output might be not the same, base on person who is running the job (e.g. diffrent Row Level Access). If...

0 kudos

12-21-2023 3:24:26 PM

by esauesp_co • New Contributor III

01-23-2023 3:31:54 PM

2650 Views
5 replies
1 kudos

Resolved! My jobs and cluster were deleted in a suspicious way

I want to know what happen with my cluster and if I can recover it.I entered to my Databricks account and I didn't found my jobs and my cluster. I couldn't find any log of the deleted cluster because the log is into the cluster interface. I entered t...

Data Engineering

2650 Views
5 replies
1 kudos

01-23-2023 3:31:54 PM

View Replies

Latest Reply

Sid_databricks
New Contributor II

12-21-2023 3:22:29 AM

1 kudos

Dear folks,When the tables has been deleted, then why I am unable to create the table with same name.It continiously giving me error"DeltaAnalysisException: Cannot create table ('`spark_catalog`.`default`.`Customer_Data`'). The associated location ('...

1 kudos

12-21-2023 3:22:29 AM

4 More Replies

by MattPython • New Contributor

02-01-2023 5:20:15 AM

10395 Views
4 replies
0 kudos

How do you read files from the DBFS with OS and Pandas Python libraries?

I created translations for decoded values and want to save the dictionary object the DBFS for mapping. However, I am unable to access the DBFS without using dbutils or PySpark library. Is there a way to access the DBFS with OS and Pandas Python libra...

Data Engineering

10395 Views
4 replies
0 kudos

02-01-2023 5:20:15 AM

View Replies

Latest Reply

User16789202230
New Contributor II

12-21-2023 2:38:02 AM

0 kudos

db_path = 'file:///Workspace/Users/l<xxxxx>@databricks.com/TITANIC_DEMO/tested.csv' df = spark.read.csv(db_path, header = "True", inferSchema="True")

0 kudos

12-21-2023 2:38:02 AM

3 More Replies

by SimonXu • New Contributor II

12-01-2022 7:31:36 PM

5649 Views
6 replies
15 kudos

Resolved! Failed to launch pipeline cluster

Hi, there. I encountered an issue when I was trying to create my delta live table pipeline. The error is "DataPlaneException: Failed to launch pipeline cluster 1202-031220-urn0toj0: Could not launch cluster due to cloud provider failures. azure_error...

Data Engineering

5649 Views
6 replies
15 kudos

12-01-2022 7:31:36 PM

View Replies

Latest Reply

arpit
Contributor III

01-26-2023 11:20:38 AM

15 kudos

@Simon Xu I suspect that DLT is trying to grab some machine types that you simply have zero quota for in your Azure account. By default, below machine type gets requested behind the scenes for DLT:AWS: c5.2xlargeAzure: Standard_F8sGCP: e2-standard-8...

15 kudos

01-26-2023 11:20:38 AM

5 More Replies

by rendorHaevyn • New Contributor III

06-20-2023 4:53:14 PM

3233 Views
3 replies
0 kudos

Databricks SQL Warehouse did not auto stop after specified 90 minute interval - why not?

In this specific case, we're running a 2XSmall SQL Warehouse on Databricks SQL.In looking at the SQL Warehouse monitoring log for this cluster, we noticed:final query executed by user at 10:26 on 2023-06-20no activity for some time, yet cluster remai...

Data Engineering

3233 Views
3 replies
0 kudos

06-20-2023 4:53:14 PM

View Replies

Latest Reply

Emil_Kaminski
Contributor

12-20-2023 1:28:33 PM

0 kudos

@Michael42 sounds like some sort of horror story. Let us know how it goes. It happens to me as well, but I was lucky enough to have this situation on very small compute cluster for just couple of days.

0 kudos

12-20-2023 1:28:33 PM

2 More Replies

User

Count

1602

736

344

284

247

Databricks

Forum Posts

Handling data close to SCD2 with Delta tables

Resolved! I want to compare two data frames. In output I wish to see unmatched Rows and the columns identified leading to the differences.

Resolved! Databricks approaches to CDC

BINARY data type

Shutdown Cluster in script if there is any failure

Remove description from job

How to change cluster size using a script

Change DLT table type from streaming to 'normal'

Delta Lake, CFD & SCD2

Resource exhaustion when using default apply_changes python functionality

Can job 'run_as' be assigned to users/principals who actually run it?

Resolved! My jobs and cluster were deleted in a suspicious way

How do you read files from the DBFS with OS and Pandas Python libraries?

Resolved! Failed to launch pipeline cluster

Databricks SQL Warehouse did not auto stop after specified 90 minute interval - why not?

Best way to parse Google Analytics data in Databri...

DELTA_EXCEED_CHAR_VARCHAR_LIMIT

Not able to set run_as service_principal_name

Pyspark operations slowness in CLuster 14.3LTS as ...

[Databricks Assets Bundles] Workflow trigger on fi...