Data Engineering

Forum Posts

Sorted by:

by Zeruno • New Contributor II

08-15-2024 4:58:18 PM

1330 Views
1 replies
1 kudos

How to use DLT Expectations for uniqueness checks on a dataset?

I am using dlt through python to build a DLT pipeline. One of things I would like to do is to check that each incoming row does not exist in the target table; i want to be sure that each row is unique.I am confused because it seems like this is not p...

Data Engineering

1330 Views
1 replies
1 kudos

08-15-2024 4:58:18 PM

View Replies

Latest Reply

Mauro
Databricks Partner

11-19-2024 6:55:09 AM

1 kudos

I also have the same doubt, about the implementation of the uniqueness rule

1 kudos

11-19-2024 6:55:09 AM

by Tiwarisk • New Contributor III

11-18-2024 11:25:40 PM

2485 Views
6 replies
0 kudos

Dynamic IP address in databricks

Everytime I am running a script in databricks which fetches data from a sql server(different Azure resource group) I am getting this error.com.microsoft.sqlserver.jdbc.SQLServerException: Cannot open server 'proddatabase' requested by the login. Clie...

Data Engineering

2485 Views
6 replies
0 kudos

11-18-2024 11:25:40 PM

View Replies

Latest Reply

ameet9257
Contributor

11-19-2024 3:45:35 AM

0 kudos

@Tiwarisk ,If your Databricks is under the secure VNET then whitelist the Private VNET address range.

0 kudos

11-19-2024 3:45:35 AM

5 More Replies

by genevive_mdonça • Databricks Employee

11-17-2024 11:04:01 PM

5064 Views
4 replies
4 kudos

Spark Optimization

Optimizing Shuffle Partition Size in Spark for Large Joins I am working on a Spark join between two tables of sizes 300 GB and 5 GB, respectively. After analyzing the Spark UI, I noticed the following:- The average shuffle write partition size for th...

Data Engineering

5064 Views
4 replies
4 kudos

11-17-2024 11:04:01 PM

View Replies

Latest Reply

Lakshay
Databricks Employee

11-18-2024 9:05:05 AM

4 kudos

Have you tried using spark.sql.files.maxPartitionBytes=209715200

4 kudos

11-18-2024 9:05:05 AM

3 More Replies

by guangyi • Contributor III

11-14-2024 9:11:03 PM

2952 Views
4 replies
0 kudos

Resolved! What is the correct way to measure the performance of a Databrick notebook?

Here is my code for converting one column field of a data frame to time data type: col_value = df.select(df.columns[0]).first()[0] start_time = time.time() col_value = datetime.strftime(col_value, "%Y-%m-%d %H:%M:%S") \ if isinstance(co...

Data Engineering

2952 Views
4 replies
0 kudos

11-14-2024 9:11:03 PM

View Replies

Latest Reply

Lakshay
Databricks Employee

11-18-2024 8:44:50 AM

0 kudos

How many columns do you have?

0 kudos

11-18-2024 8:44:50 AM

3 More Replies

by Vetrivel • Databricks Partner

10-14-2024 9:28:18 PM

5189 Views
7 replies
2 kudos

Connection Challenges with Azure Databricks and SQL Server On VM in Serverless compute

We have established an Azure Databricks workspace within our central subscription, which hosts all common platform resources. Additionally, we have a SQL Server running on a virtual machine in a separate sandbox subscription, containing data that nee...

Data Engineering

5189 Views
7 replies
2 kudos

10-14-2024 9:28:18 PM

View Replies

Latest Reply

Vetrivel
Databricks Partner

10-16-2024 10:58:00 PM

2 kudos

@Mo I have tried and got below error.Private access to resource type 'Microsoft.Compute/virtualMachines' is not supported with group id 'sqlserver'.I hope it supports only if the destinations are Blob, ADLS and Azure SQL.

2 kudos

10-16-2024 10:58:00 PM

6 More Replies

by Erik_L • Contributor II

01-31-2023 5:31:49 PM

7626 Views
4 replies
4 kudos

Resolved! Support for Parquet brotli compression or a work around

Spark 3.3.1 supports the brotli compression codec, but when I use it to read parquet files from S3, I get:INVALID_ARGUMENT: Unsupported codec for Parquet page: BROTLIExample code:df = (spark.read.format("parquet") .option("compression", "brotli")...

Data Engineering

7626 Views
4 replies
4 kudos

01-31-2023 5:31:49 PM

View Replies

Latest Reply

Erik_L
Contributor II

02-01-2023 1:48:21 PM

4 kudos

Given the new information I appended, I looked into the Delta caching and I can disable it:.option("spark.databricks.io.cache.enabled", False)This works as a work around while I read these files in to save them locally in DBFS, but does it have perfo...

4 kudos

02-01-2023 1:48:21 PM

3 More Replies

by Mystagon • New Contributor III

01-23-2024 2:10:58 AM

5507 Views
4 replies
3 kudos

Performance Issues with Unity Catalog

Hey I need some help / suggestions troubleshooting this, I have two DataBricks Workspaces Common and Lakehouse. There difference between them is: Major Differences:- Lakehouse is using Unity Catalog- Lakehouse is using External Locations whereas cre...

Data Engineering

5507 Views
4 replies
3 kudos

01-23-2024 2:10:58 AM

View Replies

Latest Reply

arjun_kr
Databricks Employee

11-18-2024 3:43:36 PM

3 kudos

- Listing directories in common is at least 4-8 times faster than Lakehouse environment. Are you able to replicate the issue using simple a dbutils list operation (dbutils.fs.ls) or by performing a sample file (say 100 MB file) copy using dbutils.f...

3 kudos

11-18-2024 3:43:36 PM

3 More Replies

by MikeGo • Contributor II

10-05-2024 10:54:02 PM

6025 Views
2 replies
1 kudos

Resolved! How to disable all cache

Hi, I'm trying to test some SQL perf. I run below firstspark.conf.set('spark.databricks.io.cache.enabled', False) However, the 2nd run for the same query is still way faster than the first time run. Is there a way to make the query start from a clean...

Data Engineering

6025 Views
2 replies
1 kudos

10-05-2024 10:54:02 PM

View Replies

Latest Reply

MikeGo
Contributor II

11-18-2024 1:32:40 PM

1 kudos

Thanks @VZLA . How to runspark.sparkContext.getPersistentRDDs.values.foreach(_.unpersist())from databricks notebook?

1 kudos

11-18-2024 1:32:40 PM

1 More Replies

by farbodr • New Contributor II

10-13-2022 10:56:58 AM

6857 Views
5 replies
1 kudos

Shapley Progressbar

The shapley progress bar or tqdm progress bar in general doesn't show in notebooks. Do I need to set something special to get this or any other similar widgets to work?

Data Engineering

6857 Views
5 replies
1 kudos

10-13-2022 10:56:58 AM

View Replies

Latest Reply

richk7
New Contributor II

11-18-2024 11:47:37 AM

1 kudos

I think you're looking for tqdm.notebookfrom time import sleepfrom tqdm.notebook import tqdmfor _ in tqdm(range(20)): sleep(5)

1 kudos

11-18-2024 11:47:37 AM

4 More Replies

by JacobLi_LN • New Contributor II

11-18-2024 9:37:12 AM

4923 Views
1 replies
1 kudos

Resolved! Where can I find those delta table log files?

I created a delta table with SQL command CREATE TABLE, and inserted several records into with INSERT statements. Now it can be seen from the catalog.But I want to understand how delta works, and would like to see where are those log files stored.Even...

Data Engineering

4923 Views
1 replies
1 kudos

11-18-2024 9:37:12 AM

View Replies

Latest Reply

Alberto_Umana
Databricks Employee

11-18-2024 10:59:42 AM

1 kudos

To locate the log files for your Delta table, please first note that Delta Lake stores its transaction log files in a specific directory within the table's storage location. These log files are for maintaining the ACID properties and enabling feature...

1 kudos

11-18-2024 10:59:42 AM

by Terraformuser • New Contributor

11-15-2024 8:59:54 AM

1755 Views
1 replies
0 kudos

Azure Databricks - Terraform errors while using workspace level provider

Hello All,I have a question about deploying Azure Databricks with Terraform. Does Databricks have any API call limits? I can deploy external location, a storage credential and it's tested and confirmed working. But when i try to deploy 2 additional e...

Data Engineering

1755 Views
1 replies
0 kudos

11-15-2024 8:59:54 AM

View Replies

Latest Reply

Alberto_Umana
Databricks Employee

11-18-2024 7:14:53 AM

0 kudos

Hello @Terraformuser, Could you try to enable debug on while applying your terraform, that would give you more context on the failure. TF_LOG=DEBUG DATABRICKS_DEBUG_TRUNCATE_BYTES=250000 terraform apply -no-color 2>&1 |tee tf-debug.log

0 kudos

11-18-2024 7:14:53 AM

by TamD • Contributor

11-17-2024 4:48:45 PM

2650 Views
1 replies
1 kudos

TIME data type

Our business does a LOT of reporting and analysis by time-of-day and clock times, independent of day or date. Databricks does not seem to support the TIME data type, that I can see. If I attempt to import data recorded as a time (eg., 02:59:59.000)...

Data Engineering

2650 Views
1 replies
1 kudos

11-17-2024 4:48:45 PM

View Replies

Latest Reply

szymon_dybczak
Esteemed Contributor III

11-18-2024 2:02:09 AM

1 kudos

Hi @TamD ,Basically, it's just like you've written. There is no TIME data type, so you have 2 options which you already mentioned:- you can use Timestamp data type and ignore its date part- store it as string and do conversion each time you need it

1 kudos

11-18-2024 2:02:09 AM

by Phani1 • Databricks MVP

04-17-2024 10:31:16 PM

2881 Views
2 replies
0 kudos

Code Review tools

Could you kindly recommend any Code Review tools that would be suitable for our Databricks tech stack?

Data Engineering

code review

2881 Views
2 replies
0 kudos

04-17-2024 10:31:16 PM

View Replies

Latest Reply

Phani1
Databricks MVP

11-18-2024 1:23:22 AM

0 kudos

You can explore - SonarQube

0 kudos

11-18-2024 1:23:22 AM

1 More Replies

by TinasheChinyati • New Contributor III

11-14-2024 12:50:40 AM

4055 Views
3 replies
1 kudos

Resolved! Retention window from DLT created Delta tables

Hi guysI am working with data ingested from Azure EventHub using Delta Live Tables in databricks. Our data architecture includes the medallion approach. Our current requirement is to retain only the most recent 14 days of data in the silver layer. To...

Data Engineering

data engineer

Delta Live Tables

4055 Views
3 replies
1 kudos

11-14-2024 12:50:40 AM

View Replies

Latest Reply

TinasheChinyati
New Contributor III

11-17-2024 10:57:47 PM

1 kudos

Hi @MuthuLakshmi Thank you for sharing the configurations. Here is a bit more clarity on our current workflow.DELETE and VACUUM WorkflowOur workflow involves the following:1. DELETE Operation:We delete records matching a specific predicate to mark th...

1 kudos

11-17-2024 10:57:47 PM

2 More Replies

by sathya08 • New Contributor III

11-10-2024 9:14:59 PM

5775 Views
9 replies
4 kudos

Resolved! Trigger queries to SQL warehouse from Databricks notebook

Hello, I am trying to explore triggering for sql queries from Databricks notebook to serverless sql warehouse along with nest-asyncio module.Both the above are very new for me and need help on the same.For triggering the API from notebook, I am using...

Data Engineering

5775 Views
9 replies
4 kudos

11-10-2024 9:14:59 PM

View Replies

Latest Reply

sathya08
New Contributor III

11-17-2024 4:43:47 PM

4 kudos

Thankyou, it really helped.

4 kudos

11-17-2024 4:43:47 PM

8 More Replies

Databricks Community

Forum Posts

How to use DLT Expectations for uniqueness checks on a dataset?

Dynamic IP address in databricks

Spark Optimization

Resolved! What is the correct way to measure the performance of a Databrick notebook?

Connection Challenges with Azure Databricks and SQL Server On VM in Serverless compute

Resolved! Support for Parquet brotli compression or a work around

Performance Issues with Unity Catalog

Resolved! How to disable all cache

Shapley Progressbar

Resolved! Where can I find those delta table log files?

Azure Databricks - Terraform errors while using workspace level provider

TIME data type

Code Review tools

Resolved! Retention window from DLT created Delta tables

Resolved! Trigger queries to SQL warehouse from Databricks notebook

File Arrival Trigger - Multiple tables

Issue while handling Deletes and Inserts in Struct...

DLT with CDC and schema changes in streaming pipel...

how to update not tracked column only in new row v...

Databricks Cost Estimation Template