cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Forum Posts

Zeruno
by New Contributor II
  • 1330 Views
  • 1 replies
  • 1 kudos

How to use DLT Expectations for uniqueness checks on a dataset?

I am using dlt through python to build a DLT pipeline. One of things I would like to do is to check that each incoming row does not exist in the target table; i want to be sure that each row is unique.I am confused because it seems like this is not p...

  • 1330 Views
  • 1 replies
  • 1 kudos
Latest Reply
Mauro
Databricks Partner
  • 1 kudos

I also have the same doubt, about the implementation of the uniqueness rule

  • 1 kudos
Tiwarisk
by New Contributor III
  • 2485 Views
  • 6 replies
  • 0 kudos

Dynamic IP address in databricks

Everytime I am running a script in databricks which fetches data from a sql server(different Azure resource group) I am getting this error.com.microsoft.sqlserver.jdbc.SQLServerException: Cannot open server 'proddatabase' requested by the login. Clie...

Tiwarisk_0-1732001112472.png
  • 2485 Views
  • 6 replies
  • 0 kudos
Latest Reply
ameet9257
Contributor
  • 0 kudos

@Tiwarisk ,If your Databricks is under the secure VNET then whitelist the Private VNET address range. 

  • 0 kudos
5 More Replies
genevive_mdonça
by Databricks Employee
  • 5064 Views
  • 4 replies
  • 4 kudos

Spark Optimization

Optimizing Shuffle Partition Size in Spark for Large Joins I am working on a Spark join between two tables of sizes 300 GB and 5 GB, respectively. After analyzing the Spark UI, I noticed the following:- The average shuffle write partition size for th...

  • 5064 Views
  • 4 replies
  • 4 kudos
Latest Reply
Lakshay
Databricks Employee
  • 4 kudos

Have you tried using spark.sql.files.maxPartitionBytes=209715200

  • 4 kudos
3 More Replies
guangyi
by Contributor III
  • 2952 Views
  • 4 replies
  • 0 kudos

Resolved! What is the correct way to measure the performance of a Databrick notebook?

Here is my code for converting one column field of a data frame to time data type:  col_value = df.select(df.columns[0]).first()[0] start_time = time.time() col_value = datetime.strftime(col_value, "%Y-%m-%d %H:%M:%S") \ if isinstance(co...

  • 2952 Views
  • 4 replies
  • 0 kudos
Latest Reply
Lakshay
Databricks Employee
  • 0 kudos

How many columns do you have?

  • 0 kudos
3 More Replies
Vetrivel
by Databricks Partner
  • 5189 Views
  • 7 replies
  • 2 kudos

Connection Challenges with Azure Databricks and SQL Server On VM in Serverless compute

We have established an Azure Databricks workspace within our central subscription, which hosts all common platform resources. Additionally, we have a SQL Server running on a virtual machine in a separate sandbox subscription, containing data that nee...

  • 5189 Views
  • 7 replies
  • 2 kudos
Latest Reply
Vetrivel
Databricks Partner
  • 2 kudos

@Mo I have tried and got below error.Private access to resource type 'Microsoft.Compute/virtualMachines' is not supported with group id 'sqlserver'.I hope it supports only if the destinations are Blob, ADLS and Azure SQL.

  • 2 kudos
6 More Replies
Erik_L
by Contributor II
  • 7626 Views
  • 4 replies
  • 4 kudos

Resolved! Support for Parquet brotli compression or a work around

Spark 3.3.1 supports the brotli compression codec, but when I use it to read parquet files from S3, I get:INVALID_ARGUMENT: Unsupported codec for Parquet page: BROTLIExample code:df = (spark.read.format("parquet") .option("compression", "brotli")...

  • 7626 Views
  • 4 replies
  • 4 kudos
Latest Reply
Erik_L
Contributor II
  • 4 kudos

Given the new information I appended, I looked into the Delta caching and I can disable it:.option("spark.databricks.io.cache.enabled", False)This works as a work around while I read these files in to save them locally in DBFS, but does it have perfo...

  • 4 kudos
3 More Replies
Mystagon
by New Contributor III
  • 5507 Views
  • 4 replies
  • 3 kudos

Performance Issues with Unity Catalog

Hey I need some help /  suggestions troubleshooting this, I have two DataBricks Workspaces Common and Lakehouse. There difference between them is: Major Differences:- Lakehouse is using Unity Catalog- Lakehouse is using External Locations whereas cre...

  • 5507 Views
  • 4 replies
  • 3 kudos
Latest Reply
arjun_kr
Databricks Employee
  • 3 kudos

- Listing directories in common is at least 4-8 times faster than Lakehouse environment.   Are you able to replicate the issue using simple a dbutils list operation (dbutils.fs.ls) or by performing a sample file (say 100 MB file) copy using dbutils.f...

  • 3 kudos
3 More Replies
MikeGo
by Contributor II
  • 6025 Views
  • 2 replies
  • 1 kudos

Resolved! How to disable all cache

Hi, I'm trying to test some SQL perf. I run below firstspark.conf.set('spark.databricks.io.cache.enabled', False) However, the 2nd run for the same query is still way faster than the first time run. Is there a way to make the query start from a clean...

  • 6025 Views
  • 2 replies
  • 1 kudos
Latest Reply
MikeGo
Contributor II
  • 1 kudos

Thanks @VZLA . How to runspark.sparkContext.getPersistentRDDs.values.foreach(_.unpersist())from databricks notebook? 

  • 1 kudos
1 More Replies
farbodr
by New Contributor II
  • 6857 Views
  • 5 replies
  • 1 kudos

Shapley Progressbar

The shapley progress bar or tqdm progress bar in general doesn't show in notebooks. Do I need to set something special to get this or any other similar widgets to work?

  • 6857 Views
  • 5 replies
  • 1 kudos
Latest Reply
richk7
New Contributor II
  • 1 kudos

I think you're looking for tqdm.notebookfrom time import sleepfrom tqdm.notebook import tqdmfor _ in tqdm(range(20)): sleep(5)

  • 1 kudos
4 More Replies
JacobLi_LN
by New Contributor II
  • 4923 Views
  • 1 replies
  • 1 kudos

Resolved! Where can I find those delta table log files?

I created a delta table with SQL command CREATE TABLE, and inserted several records into with INSERT statements. Now it can be seen from the catalog.But I want to understand how delta works, and would like to see where are those log files stored.Even...

JacobLi_LN_0-1731951036765.png JacobLi_LN_1-1731951121485.png JacobLi_LN_2-1731951201603.png JacobLi_LN_3-1731951363574.png
  • 4923 Views
  • 1 replies
  • 1 kudos
Latest Reply
Alberto_Umana
Databricks Employee
  • 1 kudos

To locate the log files for your Delta table, please first note that Delta Lake stores its transaction log files in a specific directory within the table's storage location. These log files are for maintaining the ACID properties and enabling feature...

  • 1 kudos
Terraformuser
by New Contributor
  • 1755 Views
  • 1 replies
  • 0 kudos

Azure Databricks - Terraform errors while using workspace level provider

Hello All,I have a question about deploying Azure Databricks with Terraform. Does Databricks have any API call limits? I can deploy external location, a storage credential and it's tested and confirmed working. But when i try to deploy 2 additional e...

  • 1755 Views
  • 1 replies
  • 0 kudos
Latest Reply
Alberto_Umana
Databricks Employee
  • 0 kudos

Hello @Terraformuser, Could you try to enable debug on while applying your terraform, that would give you more context on the failure. TF_LOG=DEBUG DATABRICKS_DEBUG_TRUNCATE_BYTES=250000 terraform apply -no-color 2>&1 |tee tf-debug.log

  • 0 kudos
TamD
by Contributor
  • 2650 Views
  • 1 replies
  • 1 kudos

TIME data type

Our business does a LOT of reporting and analysis by time-of-day and clock times, independent of day or date.  Databricks does not seem to support the TIME data type, that I can see.  If I attempt to import data recorded as a time (eg., 02:59:59.000)...

  • 2650 Views
  • 1 replies
  • 1 kudos
Latest Reply
szymon_dybczak
Esteemed Contributor III
  • 1 kudos

Hi @TamD ,Basically, it's just like you've written. There is no TIME data type, so you have 2 options which you already mentioned:-  you can use Timestamp data type and ignore its date part-  store it as string and do conversion each time you need it

  • 1 kudos
Phani1
by Databricks MVP
  • 2881 Views
  • 2 replies
  • 0 kudos

Code Review tools

Could you kindly recommend any Code Review tools that would be suitable for our Databricks tech stack?

Data Engineering
code review
  • 2881 Views
  • 2 replies
  • 0 kudos
Latest Reply
Phani1
Databricks MVP
  • 0 kudos

You can explore - SonarQube

  • 0 kudos
1 More Replies
TinasheChinyati
by New Contributor III
  • 4055 Views
  • 3 replies
  • 1 kudos

Resolved! Retention window from DLT created Delta tables

Hi guysI am working with data ingested from Azure EventHub using Delta Live Tables in databricks. Our data architecture includes the medallion approach. Our current requirement is to retain only the most recent 14 days of data in the silver layer. To...

Data Engineering
data engineer
Delta Live Tables
  • 4055 Views
  • 3 replies
  • 1 kudos
Latest Reply
TinasheChinyati
New Contributor III
  • 1 kudos

Hi @MuthuLakshmi Thank you for sharing the configurations. Here is a bit more clarity on our current workflow.DELETE and VACUUM WorkflowOur workflow involves the following:1. DELETE Operation:We delete records matching a specific predicate to mark th...

  • 1 kudos
2 More Replies
sathya08
by New Contributor III
  • 5775 Views
  • 9 replies
  • 4 kudos

Resolved! Trigger queries to SQL warehouse from Databricks notebook

Hello, I am trying to explore triggering for sql queries from Databricks notebook to serverless sql warehouse along with nest-asyncio module.Both the above are very new for me and need help on the same.For triggering the API from notebook, I am using...

  • 5775 Views
  • 9 replies
  • 4 kudos
Latest Reply
sathya08
New Contributor III
  • 4 kudos

Thankyou, it really helped.

  • 4 kudos
8 More Replies
Labels