cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Forum Posts

ankitmit
by New Contributor III
  • 701 Views
  • 1 replies
  • 0 kudos

How to recreate DLT tables from backup, if we loose all the files in case of disaster.

Hi All,We're currently building our new data platform and want to have backup strategy in place incase of disaster ( loosing all the data files )We plan to take backup of our Bronze and Silver layer DLT table data files on AWS and would like to know ...

  • 701 Views
  • 1 replies
  • 0 kudos
Latest Reply
Alberto_Umana
Databricks Employee
  • 0 kudos

Hi @ankitmit, You may want to see this article: https://www.databricks.com/blog/2023/03/17/production-ready-and-resilient-disaster-recovery-dlt-pipelines.html

  • 0 kudos
VictorFerron
by New Contributor III
  • 3614 Views
  • 4 replies
  • 3 kudos

Resolved! Issue with Maven Dependency Resolution in Databricks (json-smart)

Hi,I'm experiencing issues installing a library in my Databricks cluster due to a Maven dependency resolution error. Specifically, when trying to install or use libraries that depend on net.minidev:json-smart, I get the following error:  Library inst...

  • 3614 Views
  • 4 replies
  • 3 kudos
Latest Reply
MLob
New Contributor II
  • 3 kudos

Thanks VitorFerron, exlcuding json-smart dependency worked for installing affected library with Maven (azure-eventhubs-spark_2.12:2.3.22 in my case).When you say force a stable version instead, do you mean explicitly install json-smart library on clu...

  • 3 kudos
3 More Replies
Scanning
by New Contributor II
  • 710 Views
  • 2 replies
  • 0 kudos

Git Server Proxy won't stay up

We are running a job with multiple tasks that require the Git Server Proxy to remain operational for the entire duration of the job.Since each task may need access to the proxy, and job runtimes vary from brief to several hours, what is the best appr...

  • 710 Views
  • 2 replies
  • 0 kudos
Latest Reply
Scanning
New Contributor II
  • 0 kudos

The Git Server Proxy is needed for various tasks within the job, and multiple jobs may rely on it. Manually terminating it at the end of job1 could lead to job2 failing.

  • 0 kudos
1 More Replies
Junda
by New Contributor III
  • 3545 Views
  • 3 replies
  • 0 kudos

How to interpret Spark UI

The images below are DAG and text execution summary from Spark UI and I'm having hard time interpreting these logs. I have two questions below.1. In the Text Execution Summary, Duration total for WholeStageCodegen (2) says 38.3 m (0 ms, 2.7 m, 16.9 m...

  • 3545 Views
  • 3 replies
  • 0 kudos
Latest Reply
Junda
New Contributor III
  • 0 kudos

Hi @Alberto_Umana and @jay250terry , thank you for your reply. I know that spark execute task in parallel and the sum of the each task execution time does not correspond with the overall total job duration.What I don't get from the text execution sum...

  • 0 kudos
2 More Replies
mbravonxp
by New Contributor II
  • 6535 Views
  • 3 replies
  • 3 kudos

Resolved! Unity Catalog for medallion architecture

Hello community. I need help to define the most suitable approach for Unity Catalog. I have the following storage architecture in Azure Data Lake Storage. I have data from different clientsI work with 3 different environments for each client: dev, pr...

  • 6535 Views
  • 3 replies
  • 3 kudos
Latest Reply
mbravonxp
New Contributor II
  • 3 kudos

Hi both,Thanks very much for the useful replies. Definitely I will go for your suggestions.Best.

  • 3 kudos
2 More Replies
SivaPK
by New Contributor II
  • 3986 Views
  • 1 replies
  • 0 kudos

Converting utf-8 is not reflecting panda data frame to workspace for csv file?

Hi,I would like to convert the specific column into utf-8 for all the country languages. After converting into and writing into workspace and download it to my local system and opening the excel file still other country character are not displayed pr...

  • 3986 Views
  • 1 replies
  • 0 kudos
Latest Reply
filipniziol
Esteemed Contributor
  • 0 kudos

Hi @SivaPK , It may happen that the CSV is fine, but Excel often does not automatically recognize CSV files as UTF-8 (it might assume ANSI or Windows-1252). This is why characters can appear incorrect. Instead, open the CSV in a text editor like Note...

  • 0 kudos
PTalathi
by New Contributor
  • 2127 Views
  • 2 replies
  • 0 kudos

Databricks Alerts Query result rows not being sent as a part of the email body

I am using a custom template in "Databricks Alerts" and an html code within it in order to trigger the query results within the body of the email. But unfortunately the email body contains the only the header specified in the html code and not the ro...

  • 2127 Views
  • 2 replies
  • 0 kudos
Latest Reply
ahetesham
New Contributor II
  • 0 kudos

I tried lot of combinations , but it seems it doesn't support lot of html tags. No style no color or formating.if anyone aware please provide solution. Also alert sending only 100 rows to email.

  • 0 kudos
1 More Replies
HoussemBL
by New Contributor III
  • 2440 Views
  • 2 replies
  • 0 kudos

DLT- apply_changes() SCD2 is not applying defined schema only for first run

Hello community,I am using dlt.apply_changes function to implement SCD2. I am specifying the schema of my streaming_table that should result from apply_changes().This schema contains a generated column.Somehow, my DLT pipeline returns always in first...

  • 2440 Views
  • 2 replies
  • 0 kudos
Latest Reply
Alberto_Umana
Databricks Employee
  • 0 kudos

Hello @HoussemBL, Here are a few points to consider: Initialization of Generated Columns: Generated columns, such as is_current, rely on the values of other columns (__END_AT in this case) to be correctly populated. During the first run, if the sequ...

  • 0 kudos
1 More Replies
Mukul3012
by New Contributor II
  • 1737 Views
  • 2 replies
  • 0 kudos

DLT pipeline table already exists error

Hi All,I have been facing an issue with few of my DLT pipelines.source code:CREATE OR REFRESH STREAMING TABLE ****TBLPROPERTIES(    "enableChangeDataFeed"= "true",  "delta.autoOptimize.optimizeWrite"="true")AS SELECT *,      _metadata.file_path as fi...

  • 1737 Views
  • 2 replies
  • 0 kudos
Latest Reply
Alberto_Umana
Databricks Employee
  • 0 kudos

This can happen if the table was not properly dropped or if there is a naming conflict. Before creating or refreshing the table, check if it already exists in the catalog.  SHOW TABLES IN <database_name>;

  • 0 kudos
1 More Replies
subhas_hati
by New Contributor
  • 543 Views
  • 1 replies
  • 0 kudos

JOIN Two Big Tables, each being some terabytes.

What is the strategy for joinning two big tables, each being some terrabytes. 

  • 543 Views
  • 1 replies
  • 0 kudos
Latest Reply
Alberto_Umana
Databricks Employee
  • 0 kudos

Hi @subhas_hati, Enable AQE to dynamically optimize the join strategy at runtime based on the actual data distribution. This can help in choosing the best join strategy automatically If you are using Delta tables, you can leverage the MERGE statement...

  • 0 kudos
vineet_chaure
by New Contributor
  • 1120 Views
  • 1 replies
  • 0 kudos

Handling Large Integers and None Values in pandas UDFs on Databricks

Hi Everyone,I hope this message finds you well.I am encountering an issue with pandas UDFs on a Databricks shared cluster and would like to seek assistance from the community. Below is a summary of the problem:Description:I am working with pandas UDF...

  • 1120 Views
  • 1 replies
  • 0 kudos
Latest Reply
Alberto_Umana
Databricks Employee
  • 0 kudos

Hello @vineet_chaure, By default, Spark converts LongType to float64 when transferring data to pandas. You can use Arrow-optimized pandas UDFs introduced in Apache Spark 3.5. Please try with below code: import pandas as pdimport pyarrow as pafrom pys...

  • 0 kudos
TejeshS
by Contributor
  • 2800 Views
  • 2 replies
  • 2 kudos

How to apply Column masking and RLS on Views in Databricks

Hello Databricks Community,We are working on a use case where we need to apply column masking and row-level filtering on top of existing views or while creating new views dynamically.Currently, we know that Delta tables support column masking and row...

  • 2800 Views
  • 2 replies
  • 2 kudos
Latest Reply
MadhuB
Valued Contributor
  • 2 kudos

@TejeshS  You can alternatively use, mask function instead of hardcoding with value 'Masked'.CREATE OR REPLACE VIEW masked_employees AS SELECT Name, Department, CASE WHEN current_user() IN ('ab***@gmail.com', 'xy***@gmail.com') THEN Salary ELSE mask(...

  • 2 kudos
1 More Replies
AndyM
by New Contributor II
  • 10138 Views
  • 2 replies
  • 2 kudos

DAB wheel installation job fails, user error Library from /Workspace not allowed

Hi Community!I am getting started with DABs and just recently ran into a following error after deployment trying to run my bundle that has a wheel installation job. Error: failed to reach TERMINATED or SKIPPED, got INTERNAL_ERROR: Task main_task fail...

  • 10138 Views
  • 2 replies
  • 2 kudos
Latest Reply
BillBishop
New Contributor III
  • 2 kudos

Did you try this in your databricks.yml?experimental: python_wheel_wrapper: true

  • 2 kudos
1 More Replies
Direo
by Contributor II
  • 6253 Views
  • 2 replies
  • 7 kudos

Performance Issue with UC Read from Federated SQL Table vs JDBC Read from SQL Server

Hi everyone,I'm currently facing a significant performance issue when comparing the execution times of a query sent through JDBC versus a similar query executed through Databricks SQL (using Unity Catalog to access a federated SQL table).JDBC Query:j...

Data Engineering
Federated queries
JDBC
performance issue
Unity Catalog
  • 6253 Views
  • 2 replies
  • 7 kudos
Latest Reply
pdiamond
Contributor
  • 7 kudos

I've found the JDBC query to be faster than the federated query because in our testing, the federated query does not pass down the full query to the source database. Instead, it's running "select * from table", pulling all of the data into Databricks...

  • 7 kudos
1 More Replies
juchom
by New Contributor II
  • 1030 Views
  • 4 replies
  • 0 kudos

Error when creating a compute resource with runtime 16.1

Hello,Is there any restriction on community edition runtimes, because when I try to create a compute resource with runtime 16.1 it takes a long time and then always end with a failure like the following screenshot.Thanks for your help,

juchom_0-1739270362738.png
  • 1030 Views
  • 4 replies
  • 0 kudos
Latest Reply
juchom
New Contributor II
  • 0 kudos

Thanks for the details, are you using any type of container (docker) image? Or any special setting on the cluster?Nothing special, I just click on create compute button, then select 16.1 from the dropdown and click create compute button again.If you ...

  • 0 kudos
3 More Replies

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!

Sign Up Now
Labels