Data Engineering

Forum Posts

Sorted by:

by Dhruv-22 • Contributor II

4 weeks ago

360 Views
2 replies
1 kudos

Resolved! Can't mergeSchema handle int and bigint?

I have a table which has a column of data type 'bigint'. While overwriting it with new data, given that I do full loads, I used 'mergeSchema' to handle schema changes. The new data's datatype was int. I thought mergeSchema can easily handle that, but...

Data Engineering

360 Views
2 replies
1 kudos

4 weeks ago

View Replies

Latest Reply

Chiran-Gajula
New Contributor III

4 weeks ago

1 kudos

Hi Dhruv,Delta won't automatically upcast unless you explicitly handle it. Cast the column Lob_Pk to LongType (which maps to BIGINT in SQL/Delta). Try below snippetfrom pyspark.sql.functions import colfrom pyspark.sql.types import LongTypecrm_retail_...

1 kudos

4 weeks ago

1 More Replies

by Marthinus • New Contributor III

4 weeks ago

294 Views
4 replies
2 kudos

Resolved! [Databricks Asset Bundles] Bug: driver_node_type_id not updated

Working with databricks asset bundles (using the new python-based definition), if you have a job_cluster defined using driver_node_type_id, and then update it to no longer have it defined, but only node_type_id, the driver node_type never gets update...

Data Engineering

294 Views
4 replies
2 kudos

4 weeks ago

View Replies

Latest Reply

Chiran-Gajula
New Contributor III

4 weeks ago

2 kudos

There is no built-in way in Databricks Asset bundles or terraform to automatically inherit the value of driver_node_type_id for node_type_id."You must set both explicitly in your configuration"You can always see your updated detail resource from the ...

2 kudos

4 weeks ago

3 More Replies

by Dhruv-22 • Contributor II

04-03-2024 3:03:26 AM

1891 Views
2 replies
0 kudos

Resolved! Understanding least common type in databricks

I was reading the data type rules and found about least common type.I have a doubt. What is the least common type of STRING and INT? The referred link gives the following example saying the least common type is BIGINT.-- The least common type between...

Data Engineering

1891 Views
2 replies
0 kudos

04-03-2024 3:03:26 AM

View Replies

Latest Reply

Dhruv-22
Contributor II

4 weeks ago

0 kudos

The question is solved here - link

0 kudos

4 weeks ago

1 More Replies

by Dhruv-22 • Contributor II

4 weeks ago

340 Views
4 replies
4 kudos

Resolved! Least Common Type is different in Serverless and All Purpose Cluster.

The following statement gives different outputs in different computes.In Databricks, 15.4 LTS%sqlSELECT typeof(coalesce(5, '6'));-- OutputstringIn Serverless, environment version 4%sqlSELECT typeof(coalesce(5, '6'));-- OutputbigintThere are other cas...

Data Engineering

340 Views
4 replies
4 kudos

4 weeks ago

View Replies

Latest Reply

MuthuLakshmi
Databricks Employee

4 weeks ago

4 kudos

@Dhruv-22 Regarding your 1st question, I'm not sureYou can refer to https://docs.databricks.com/aws/en/sql/language-manual/parameters/ansi_mode#system-default to understand what happens when ansi mode is disabled

4 kudos

4 weeks ago

3 More Replies

by anusha98 • New Contributor II

4 weeks ago

256 Views
2 replies
3 kudos

Resolved! Regarding : How to use Row_number() in dlt pipelines

We have two streaming tables : customer_info and customer_info_history and we joined them using full join to create temp table in pyspark and now we want to eliminate the de-duped records from this temp table. Tried using row_number() but facing bel...

Data Engineering

256 Views
2 replies
3 kudos

4 weeks ago

View Replies

Latest Reply

K_Anudeep
Databricks Employee

4 weeks ago

3 kudos

Hello @anusha98 , You’re hitting a real limitation of Structured Streaming: non-time window functions (like row_number() over (...)) aren’t allowed on streaming DFs. You need to use agg().max() to get the “latest value per key” @dlt.table(name="temp_...

3 kudos

4 weeks ago

1 More Replies

by AmarKap • New Contributor

4 weeks ago

151 Views
1 replies
1 kudos

Lakeflow Pipelines Trying to Read accented file with spark.readStream but failure

Trying to read a accented file(French characters) but the spark.readStream function is not working and special characters turn into something strange(ex. �) spark.readStream .format("cloudfiles") .option("cloudFiles....

Data Engineering

151 Views
1 replies
1 kudos

4 weeks ago

View Replies

Latest Reply

K_Anudeep
Databricks Employee

4 weeks ago

1 kudos

Hello @AmarKap , When Spark decodes CP1252 bytes as UTF-8/ISO-8859-1, you’ll see the replacement char like � Can you read the file as : df = (spark.readStream.format("cloudFiles").option("cloudFiles.format", "text").option("encoding", "windows-1252")...

1 kudos

4 weeks ago

by EndreM • New Contributor III

06-05-2025 7:10:56 AM

2251 Views
1 replies
1 kudos

Replay stream to migrate to liquid cluster

The documentation is sparse about how to migrate a partition table to a liquid cluster as the Alter table suggested in the documentation doesnt work when its a partitioned table.The comments on this forum suggest replaying the stream. And this is wha...

Data Engineering

2251 Views
1 replies
1 kudos

06-05-2025 7:10:56 AM

View Replies

Latest Reply

Louis_Frolio
Databricks Employee

4 weeks ago

1 kudos

Greetings @EndreM , I did some digging internally and I have come up with some helpful tips/tricks to help guide you through this issue: Based on your situation, you're encountering several common challenges when migrating a partitioned table to liqu...

1 kudos

4 weeks ago

by soumiknow • Contributor II

07-02-2025 1:19:59 AM

2369 Views
1 replies
1 kudos

Unable to create databricks group and add permission via terraform

I have the following terraform code to create a databricks group and add permission to a workflow: resource "databricks_group" "dbx_group" { display_name = "ENV_MONITORING_TEAM" } resource "databricks_permissions" "workflow_permission" { job_id ...

Data Engineering

databricks groups

Terraform

2369 Views
1 replies
1 kudos

07-02-2025 1:19:59 AM

View Replies

Latest Reply

Louis_Frolio
Databricks Employee

4 weeks ago

1 kudos

Greetings @soumiknow , I did some digging internally and found something that may help: Based on the information gathered, I can now draft a comprehensive response to this Databricks Community question about the Terraform authentication issue. ## Dra...

1 kudos

4 weeks ago

by smoortema • Contributor

10-06-2025 3:22:17 AM

320 Views
2 replies
2 kudos

Resolved! How to make FOR cycle and dynamic SQL and variables work together

I am working on a testing notebook where the table that is tested can be given as a widget. I wanted to write it in SQL. The notebook does the following steps in a cycle that should run 10 times:1. Store the starting version of a delta table in a var...

Data Engineering

320 Views
2 replies
2 kudos

10-06-2025 3:22:17 AM

View Replies

Latest Reply

smoortema
Contributor

4 weeks ago

2 kudos

Thank you! I realised that the example I gave was bad. However, what I was missing is that I did not know how to set a variable in SQL scripting. Including the SET command within the sql string does not work, you have to use the EXECUTE IMMEDIATE ......

2 kudos

4 weeks ago

1 More Replies

by DatabricksEngi1 • Contributor

a month ago

403 Views
4 replies
1 kudos

Resolved! Problem in VS Code Extention

Until a few days ago, I was working with Databricks Connect using the VS Code extension, and everything worked perfectly.In my .databrickscfg file, I had authentication configured like this: [name]host:token: When I ran my code, everything worked fi...

Data Engineering

403 Views
4 replies
1 kudos

a month ago

View Replies

Latest Reply

dkushari
Databricks Employee

a month ago

1 kudos

Hi @DatabricksEngi1 - Please ensure you have a Python Venv set up for each Python version that you use with Databricks Connect. Also, I have given step-by-step ways to debug the issue, clear the cache, etc [Read the files and instructions carefully b...

1 kudos

a month ago

3 More Replies

by manugarri • New Contributor II

03-15-2016 4:09:04 AM

19468 Views
12 replies
2 kudos

Fuzzy text matching in Spark

I have a list of client provided data, a list of company names. I have to match those names with an internal database of company names. The client list can fit in memory (its about 10k elements) but the internal dataset is on hdfs and we use Spark ...

Data Engineering

19468 Views
12 replies
2 kudos

03-15-2016 4:09:04 AM

View Replies

Latest Reply

Shamzaa3Q
New Contributor II

4 weeks ago

2 kudos

+1 for rapidfuzz, I have used it in production pipelines. Better than just levenshtein function, as rapidfuzz provides a couple of other algorithms as well. I will warn you to not do what 2024 me attempted, which is use LLM to solve for this. It soun...

2 kudos

4 weeks ago

11 More Replies

by pranaav93 • New Contributor III

a month ago

362 Views
1 replies
1 kudos

Resolved! TransformWithState is not emitting for live streams

Hi Team, For one of my custom logics i went with transformwithState processor. However it is not working for live stream inputs., I have a start date filter on my df_base and when I give start date that is not current, the processor computes df_loss ...

Data Engineering

apachespark

pyspark

StatefulStreaming

StructuredStreaming

transformWithState

362 Views
1 replies
1 kudos

a month ago

View Replies

Latest Reply

pranaav93
New Contributor III

4 weeks ago

1 kudos

I managed to solve this. The issue was with how I handled the value state in the def init method. It was handled as a dataframe which caused the state to never materialize nor update therefore emitting nulls.I changed them to a tuple of values and th...

1 kudos

4 weeks ago

by drag7ter • Contributor

05-21-2025 3:28:52 AM

2566 Views
1 replies
0 kudos

Delta sharing view and cached data in DSFF

I've created a view with row level access based on CURRENT_RECIPIENT() function in the where clause. And I have 100s of clients as recipients that query this view.The problem is, when I modify this view CREATE OR REPLACE with a new sql code, and reci...

Data Engineering

2566 Views
1 replies
0 kudos

05-21-2025 3:28:52 AM

View Replies

Latest Reply

AbhaySingh
Databricks Employee

4 weeks ago

0 kudos

Have you tried something like this already? Force Cache Invalidation (Recommended) -- After CREATE OR REPLACE VIEW, execute: ALTER SHARE <share_name> REMOVE TABLE <schema>.<view_name>; ALTER SHARE <share_name> ADD TABLE <schema>.<view_name>; Thi...

0 kudos

4 weeks ago

by abhijit007 • New Contributor

06-30-2025 9:15:30 AM

2107 Views
1 replies
1 kudos

Resolved! Lakebridge code conversion | Permission issue

Hi,I’ve successfully installed the transpile module from Lakebridge and tried the tool to convert Informatica mappings into PySpark code. However, I’m encountering a PermissionError during execution. I’ve provided the relevant environment details and...

Data Engineering

Lakebridge

Warehouse Migration

2107 Views
1 replies
1 kudos

06-30-2025 9:15:30 AM

View Replies

Latest Reply

dkushari
Databricks Employee

4 weeks ago

1 kudos

Hi @abhijit007 - I see that this has been resolved in the 0.10.5 release. Can you please retest and confirm?

1 kudos

4 weeks ago

by raghvendrarm1 • New Contributor

a month ago

297 Views
2 replies
3 kudos

Resolved! Results from the spark application to driver

I tried to read many articles but still not clear on this:The executors complete the execution of tasks and have the results with them.1. The results(output data) from all executors is transported to driver in all cases or executors persist it if tha...

Data Engineering

297 Views
2 replies
3 kudos

a month ago

View Replies

Latest Reply

K_Anudeep
Databricks Employee

a month ago

3 kudos

Hello @raghvendrarm1 , Below are the answers to your questions: Do executors always send “results” to the driver? No. Only actions that return values (e.g., collect, take, first, count) bring data back to the driver. collect explicitly “returns al...

3 kudos

a month ago

1 More Replies

Databricks Community

Forum Posts

Resolved! Can't mergeSchema handle int and bigint?

Resolved! [Databricks Asset Bundles] Bug: driver_node_type_id not updated

Resolved! Understanding least common type in databricks

Resolved! Least Common Type is different in Serverless and All Purpose Cluster.

Resolved! Regarding : How to use Row_number() in dlt pipelines

Lakeflow Pipelines Trying to Read accented file with spark.readStream but failure

Replay stream to migrate to liquid cluster

Unable to create databricks group and add permission via terraform

Resolved! How to make FOR cycle and dynamic SQL and variables work together

Resolved! Problem in VS Code Extention

Fuzzy text matching in Spark

Resolved! TransformWithState is not emitting for live streams

Delta sharing view and cached data in DSFF

Resolved! Lakebridge code conversion | Permission issue

Resolved! Results from the spark application to driver

Join Us as a Local Community Builder!

No rows returned when calling Databricks procedure...

Trouble Enabling File Events For An External Locat...

Want to use DataFrame equality functions but also ...

Loading CSV from private S3 bucket

DATA_SOURCE_NOT_FOUND Error with MongoDB (Suggesti...