cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Forum Posts

yvishal519
by Contributor
  • 8041 Views
  • 4 replies
  • 0 kudos

Resolved! Implementing Full Load Strategy with Delta Live Tables and Unity Catalog

Hello Databricks Community,I am seeking guidance on handling full load scenarios with Delta Live Tables (DLT) and Unity Catalog. Here’s the situation I’m dealing with:We have a data folder in Azure Data Lake Storage (ADLS) where we use Auto Loader to...

  • 8041 Views
  • 4 replies
  • 0 kudos
Latest Reply
yvishal519
Contributor
  • 0 kudos

To efficiently manage full data loads, we can leverage a regex pattern to dynamically identify the latest data folders within our bronze layer. These folders typically contain the most recent data updates for our tables. By using a Python script, we ...

  • 0 kudos
3 More Replies
Srujanm01
by New Contributor III
  • 2073 Views
  • 5 replies
  • 0 kudos

Resolved! Job Scheduling Timeline issue

I'm trying to schedule the job; I created the job and ran it manually it is successful but when I'm trying to schedule the job for every 14 days the date the scheduler takes it into a different date and I'm not able to understand why this is happenin...

  • 2073 Views
  • 5 replies
  • 0 kudos
Latest Reply
MadhuB
Valued Contributor
  • 0 kudos

@Srujanm01 Can you try the cron expression  "0 0 0 1/14 * ? *"It doesRun at 00:00:00 (midnight)Every 14 daysOn any day of the weekIn any monthRefer to the below screens.Let me know for anything, else mark it as a Solution.

  • 0 kudos
4 More Replies
ankitmit
by New Contributor III
  • 704 Views
  • 1 replies
  • 0 kudos

How to recreate DLT tables from backup, if we loose all the files in case of disaster.

Hi All,We're currently building our new data platform and want to have backup strategy in place incase of disaster ( loosing all the data files )We plan to take backup of our Bronze and Silver layer DLT table data files on AWS and would like to know ...

  • 704 Views
  • 1 replies
  • 0 kudos
Latest Reply
Alberto_Umana
Databricks Employee
  • 0 kudos

Hi @ankitmit, You may want to see this article: https://www.databricks.com/blog/2023/03/17/production-ready-and-resilient-disaster-recovery-dlt-pipelines.html

  • 0 kudos
VictorFerron
by New Contributor III
  • 3618 Views
  • 4 replies
  • 3 kudos

Resolved! Issue with Maven Dependency Resolution in Databricks (json-smart)

Hi,I'm experiencing issues installing a library in my Databricks cluster due to a Maven dependency resolution error. Specifically, when trying to install or use libraries that depend on net.minidev:json-smart, I get the following error:  Library inst...

  • 3618 Views
  • 4 replies
  • 3 kudos
Latest Reply
MLob
New Contributor II
  • 3 kudos

Thanks VitorFerron, exlcuding json-smart dependency worked for installing affected library with Maven (azure-eventhubs-spark_2.12:2.3.22 in my case).When you say force a stable version instead, do you mean explicitly install json-smart library on clu...

  • 3 kudos
3 More Replies
Scanning
by New Contributor II
  • 713 Views
  • 2 replies
  • 0 kudos

Git Server Proxy won't stay up

We are running a job with multiple tasks that require the Git Server Proxy to remain operational for the entire duration of the job.Since each task may need access to the proxy, and job runtimes vary from brief to several hours, what is the best appr...

  • 713 Views
  • 2 replies
  • 0 kudos
Latest Reply
Scanning
New Contributor II
  • 0 kudos

The Git Server Proxy is needed for various tasks within the job, and multiple jobs may rely on it. Manually terminating it at the end of job1 could lead to job2 failing.

  • 0 kudos
1 More Replies
Junda
by New Contributor III
  • 3559 Views
  • 3 replies
  • 0 kudos

How to interpret Spark UI

The images below are DAG and text execution summary from Spark UI and I'm having hard time interpreting these logs. I have two questions below.1. In the Text Execution Summary, Duration total for WholeStageCodegen (2) says 38.3 m (0 ms, 2.7 m, 16.9 m...

  • 3559 Views
  • 3 replies
  • 0 kudos
Latest Reply
Junda
New Contributor III
  • 0 kudos

Hi @Alberto_Umana and @jay250terry , thank you for your reply. I know that spark execute task in parallel and the sum of the each task execution time does not correspond with the overall total job duration.What I don't get from the text execution sum...

  • 0 kudos
2 More Replies
mbravonxp
by New Contributor II
  • 6540 Views
  • 3 replies
  • 3 kudos

Resolved! Unity Catalog for medallion architecture

Hello community. I need help to define the most suitable approach for Unity Catalog. I have the following storage architecture in Azure Data Lake Storage. I have data from different clientsI work with 3 different environments for each client: dev, pr...

  • 6540 Views
  • 3 replies
  • 3 kudos
Latest Reply
mbravonxp
New Contributor II
  • 3 kudos

Hi both,Thanks very much for the useful replies. Definitely I will go for your suggestions.Best.

  • 3 kudos
2 More Replies
SivaPK
by New Contributor II
  • 3993 Views
  • 1 replies
  • 0 kudos

Converting utf-8 is not reflecting panda data frame to workspace for csv file?

Hi,I would like to convert the specific column into utf-8 for all the country languages. After converting into and writing into workspace and download it to my local system and opening the excel file still other country character are not displayed pr...

  • 3993 Views
  • 1 replies
  • 0 kudos
Latest Reply
filipniziol
Esteemed Contributor
  • 0 kudos

Hi @SivaPK , It may happen that the CSV is fine, but Excel often does not automatically recognize CSV files as UTF-8 (it might assume ANSI or Windows-1252). This is why characters can appear incorrect. Instead, open the CSV in a text editor like Note...

  • 0 kudos
PTalathi
by New Contributor
  • 2131 Views
  • 2 replies
  • 0 kudos

Databricks Alerts Query result rows not being sent as a part of the email body

I am using a custom template in "Databricks Alerts" and an html code within it in order to trigger the query results within the body of the email. But unfortunately the email body contains the only the header specified in the html code and not the ro...

  • 2131 Views
  • 2 replies
  • 0 kudos
Latest Reply
ahetesham
New Contributor II
  • 0 kudos

I tried lot of combinations , but it seems it doesn't support lot of html tags. No style no color or formating.if anyone aware please provide solution. Also alert sending only 100 rows to email.

  • 0 kudos
1 More Replies
HoussemBL
by New Contributor III
  • 2442 Views
  • 2 replies
  • 0 kudos

DLT- apply_changes() SCD2 is not applying defined schema only for first run

Hello community,I am using dlt.apply_changes function to implement SCD2. I am specifying the schema of my streaming_table that should result from apply_changes().This schema contains a generated column.Somehow, my DLT pipeline returns always in first...

  • 2442 Views
  • 2 replies
  • 0 kudos
Latest Reply
Alberto_Umana
Databricks Employee
  • 0 kudos

Hello @HoussemBL, Here are a few points to consider: Initialization of Generated Columns: Generated columns, such as is_current, rely on the values of other columns (__END_AT in this case) to be correctly populated. During the first run, if the sequ...

  • 0 kudos
1 More Replies
Mukul3012
by New Contributor II
  • 1744 Views
  • 2 replies
  • 0 kudos

DLT pipeline table already exists error

Hi All,I have been facing an issue with few of my DLT pipelines.source code:CREATE OR REFRESH STREAMING TABLE ****TBLPROPERTIES(    "enableChangeDataFeed"= "true",  "delta.autoOptimize.optimizeWrite"="true")AS SELECT *,      _metadata.file_path as fi...

  • 1744 Views
  • 2 replies
  • 0 kudos
Latest Reply
Alberto_Umana
Databricks Employee
  • 0 kudos

This can happen if the table was not properly dropped or if there is a naming conflict. Before creating or refreshing the table, check if it already exists in the catalog.  SHOW TABLES IN <database_name>;

  • 0 kudos
1 More Replies
subhas_hati
by New Contributor
  • 544 Views
  • 1 replies
  • 0 kudos

JOIN Two Big Tables, each being some terabytes.

What is the strategy for joinning two big tables, each being some terrabytes. 

  • 544 Views
  • 1 replies
  • 0 kudos
Latest Reply
Alberto_Umana
Databricks Employee
  • 0 kudos

Hi @subhas_hati, Enable AQE to dynamically optimize the join strategy at runtime based on the actual data distribution. This can help in choosing the best join strategy automatically If you are using Delta tables, you can leverage the MERGE statement...

  • 0 kudos
vineet_chaure
by New Contributor
  • 1130 Views
  • 1 replies
  • 0 kudos

Handling Large Integers and None Values in pandas UDFs on Databricks

Hi Everyone,I hope this message finds you well.I am encountering an issue with pandas UDFs on a Databricks shared cluster and would like to seek assistance from the community. Below is a summary of the problem:Description:I am working with pandas UDF...

  • 1130 Views
  • 1 replies
  • 0 kudos
Latest Reply
Alberto_Umana
Databricks Employee
  • 0 kudos

Hello @vineet_chaure, By default, Spark converts LongType to float64 when transferring data to pandas. You can use Arrow-optimized pandas UDFs introduced in Apache Spark 3.5. Please try with below code: import pandas as pdimport pyarrow as pafrom pys...

  • 0 kudos
TejeshS
by Contributor
  • 2803 Views
  • 2 replies
  • 2 kudos

How to apply Column masking and RLS on Views in Databricks

Hello Databricks Community,We are working on a use case where we need to apply column masking and row-level filtering on top of existing views or while creating new views dynamically.Currently, we know that Delta tables support column masking and row...

  • 2803 Views
  • 2 replies
  • 2 kudos
Latest Reply
MadhuB
Valued Contributor
  • 2 kudos

@TejeshS  You can alternatively use, mask function instead of hardcoding with value 'Masked'.CREATE OR REPLACE VIEW masked_employees AS SELECT Name, Department, CASE WHEN current_user() IN ('ab***@gmail.com', 'xy***@gmail.com') THEN Salary ELSE mask(...

  • 2 kudos
1 More Replies
AndyM
by New Contributor II
  • 10142 Views
  • 2 replies
  • 2 kudos

DAB wheel installation job fails, user error Library from /Workspace not allowed

Hi Community!I am getting started with DABs and just recently ran into a following error after deployment trying to run my bundle that has a wheel installation job. Error: failed to reach TERMINATED or SKIPPED, got INTERNAL_ERROR: Task main_task fail...

  • 10142 Views
  • 2 replies
  • 2 kudos
Latest Reply
BillBishop
New Contributor III
  • 2 kudos

Did you try this in your databricks.yml?experimental: python_wheel_wrapper: true

  • 2 kudos
1 More Replies

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!

Sign Up Now
Labels