cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Forum Posts

lauraxyz
by Contributor
  • 1987 Views
  • 6 replies
  • 0 kudos

Notebook in path workspace/repos/.internal/**_commits/** was unable to be accessed

I have a workflow job (source is git) to access a notebook and execute it.  From the job, it failed with error:Py4JJavaError: An error occurred while calling o466.run. : com.databricks.WorkflowException: com.databricks.NotebookExecutionException: FAI...

  • 1987 Views
  • 6 replies
  • 0 kudos
Latest Reply
lauraxyz
Contributor
  • 0 kudos

Just some clarification:  the caller notebook can be found with no issues, no matter the task's source is GIT or WORKSPACE.  However, the callee notebook, which is called by the caller notebook with dbutils.notebook.run(), cannot be found if the call...

  • 0 kudos
5 More Replies
JordanYaker
by Contributor
  • 2327 Views
  • 2 replies
  • 0 kudos

Integration options for Databricks Jobs and DataDog?

I know that there is already the Databricks (technically Spark) integration for DataDog. Unfortunately, that integration only covers the cluster execution itself and that means only Cluster Metrics and Spark Jobs and Tasks. I'm looking for somethin...

  • 2327 Views
  • 2 replies
  • 0 kudos
Latest Reply
greg-0935
New Contributor
  • 0 kudos

Personally, I'm using their Data Jobs Monitoring product https://docs.datadoghq.com/data_jobs/databricks/ that works great and gives the right insights both for my high level job execution stats and Spark deeper metrics

  • 0 kudos
1 More Replies
Dhruv-22
by Contributor
  • 84 Views
  • 2 replies
  • 1 kudos

Resolved! Can't mergeSchema handle int and bigint?

I have a table which has a column of data type 'bigint'. While overwriting it with new data, given that I do full loads, I used 'mergeSchema' to handle schema changes. The new data's datatype was int. I thought mergeSchema can easily handle that, but...

  • 84 Views
  • 2 replies
  • 1 kudos
Latest Reply
Chiran-Gajula
New Contributor
  • 1 kudos

Hi Dhruv,Delta won't automatically upcast unless you explicitly handle it. Cast the column Lob_Pk to LongType (which maps to BIGINT in SQL/Delta). Try below snippetfrom pyspark.sql.functions import colfrom pyspark.sql.types import LongTypecrm_retail_...

  • 1 kudos
1 More Replies
Marthinus
by New Contributor III
  • 113 Views
  • 4 replies
  • 1 kudos

[Databricks Asset Bundles] Bug: driver_node_type_id not updated

Working with databricks asset bundles (using the new python-based definition), if you have a job_cluster defined using driver_node_type_id, and then update it to no longer have it defined, but only node_type_id, the driver node_type never gets update...

  • 113 Views
  • 4 replies
  • 1 kudos
Latest Reply
Chiran-Gajula
New Contributor
  • 1 kudos

There is no built-in way in Databricks Asset bundles or terraform to automatically inherit the value of driver_node_type_id for node_type_id."You must set both explicitly in your configuration"You can always see your updated detail resource from the ...

  • 1 kudos
3 More Replies
Dhruv-22
by Contributor
  • 1687 Views
  • 2 replies
  • 0 kudos

Resolved! Understanding least common type in databricks

I was reading the data type rules and found about least common type.I have a doubt. What is the least common type of STRING and INT? The referred link gives the following example saying the least common type is BIGINT.-- The least common type between...

  • 1687 Views
  • 2 replies
  • 0 kudos
Latest Reply
Dhruv-22
Contributor
  • 0 kudos

The question is solved here - link

  • 0 kudos
1 More Replies
Dhruv-22
by Contributor
  • 122 Views
  • 4 replies
  • 4 kudos

Resolved! Least Common Type is different in Serverless and All Purpose Cluster.

The following statement gives different outputs in different computes.In Databricks, 15.4 LTS%sqlSELECT typeof(coalesce(5, '6'));-- OutputstringIn Serverless, environment version 4%sqlSELECT typeof(coalesce(5, '6'));-- OutputbigintThere are other cas...

  • 122 Views
  • 4 replies
  • 4 kudos
Latest Reply
MuthuLakshmi
Databricks Employee
  • 4 kudos

@Dhruv-22 Regarding your 1st question, I'm not sureYou can refer to https://docs.databricks.com/aws/en/sql/language-manual/parameters/ansi_mode#system-default to understand what happens when ansi mode is disabled

  • 4 kudos
3 More Replies
anusha98
by New Contributor
  • 86 Views
  • 2 replies
  • 3 kudos

Regarding : How to use Row_number() in dlt pipelines

We have two streaming tables : customer_info and customer_info_history and we  joined them using full join to create temp table in pyspark and now we want to eliminate the de-duped records from this temp table. Tried using row_number() but facing bel...

  • 86 Views
  • 2 replies
  • 3 kudos
Latest Reply
K_Anudeep
Databricks Employee
  • 3 kudos

Hello @anusha98 , You’re hitting a real limitation of Structured Streaming: non-time window functions (like row_number() over (...)) aren’t allowed on streaming DFs. You need to use agg().max() to get the “latest value per key” @dlt.table(name="temp_...

  • 3 kudos
1 More Replies
AmarKap
by New Contributor
  • 76 Views
  • 1 replies
  • 1 kudos

Lakeflow Pipelines Trying to Read accented file with spark.readStream but failure

Trying to read a accented file(French characters) but the spark.readStream function is not working and special characters turn into something strange(ex. �)             spark.readStream            .format("cloudfiles")            .option("cloudFiles....

  • 76 Views
  • 1 replies
  • 1 kudos
Latest Reply
K_Anudeep
Databricks Employee
  • 1 kudos

Hello @AmarKap , When Spark decodes CP1252 bytes as UTF-8/ISO-8859-1, you’ll see the replacement char like � Can you read the file as : df = (spark.readStream.format("cloudFiles").option("cloudFiles.format", "text").option("encoding", "windows-1252")...

  • 1 kudos
Gustavo_Az
by Contributor
  • 1642 Views
  • 1 replies
  • 0 kudos

Doubt with range_join hints optimization, using INSERT INTO REPLACE WHERE

HelloIm optmizing a big notebook and have encountered many times the tip from databricks that says "Unused range join hints". Reading the documentation for reference, I have been able to supress that warning in almost all cells, but some of then rema...

range_joins.JPG
  • 1642 Views
  • 1 replies
  • 0 kudos
Latest Reply
Prajapathy_NKR
New Contributor
  • 0 kudos

Hi @Gustavo_AzTry to use explain to understand what's happening. https://spark.apache.org/docs/latest/sql-ref-syntax-qry-explain.html

  • 0 kudos
EndreM
by New Contributor III
  • 2174 Views
  • 1 replies
  • 1 kudos

Replay stream to migrate to liquid cluster

The documentation is sparse about how to migrate a partition table to a liquid cluster as the Alter table suggested in the documentation doesnt work when its a partitioned table.The comments on this forum suggest replaying the stream. And this is wha...

  • 2174 Views
  • 1 replies
  • 1 kudos
Latest Reply
Louis_Frolio
Databricks Employee
  • 1 kudos

Greetings @EndreM , I did some digging internally and I have come up with some helpful tips/tricks to help guide you through this issue: Based on your situation, you're encountering several common challenges when migrating a partitioned table to liqu...

  • 1 kudos
soumiknow
by Contributor II
  • 2041 Views
  • 1 replies
  • 1 kudos

Unable to create databricks group and add permission via terraform

I have the following terraform code to create a databricks group and add permission to a workflow: resource "databricks_group" "dbx_group" { display_name = "ENV_MONITORING_TEAM" } resource "databricks_permissions" "workflow_permission" { job_id ...

Data Engineering
databricks groups
Terraform
  • 2041 Views
  • 1 replies
  • 1 kudos
Latest Reply
Louis_Frolio
Databricks Employee
  • 1 kudos

Greetings @soumiknow , I did some digging internally and found something that may help: Based on the information gathered, I can now draft a comprehensive response to this Databricks Community question about the Terraform authentication issue. ## Dra...

  • 1 kudos
smoortema
by New Contributor III
  • 191 Views
  • 2 replies
  • 2 kudos

Resolved! How to make FOR cycle and dynamic SQL and variables work together

I am working on a testing notebook where the table that is tested can be given as a widget. I wanted to write it in SQL. The notebook does the following steps in a cycle that should run 10 times:1. Store the starting version of a delta table in a var...

  • 191 Views
  • 2 replies
  • 2 kudos
Latest Reply
smoortema
New Contributor III
  • 2 kudos

Thank you! I realised that the example I gave was bad. However, what I was missing is that I did not know how to set a variable in SQL scripting. Including the SET command within the sql string does not work, you have to use the EXECUTE IMMEDIATE ......

  • 2 kudos
1 More Replies
DatabricksEngi1
by New Contributor III
  • 190 Views
  • 4 replies
  • 1 kudos

Resolved! Problem in VS Code Extention

Until a few days ago, I was working with Databricks Connect using the VS Code extension, and everything worked perfectly.In my .databrickscfg file, I had authentication configured like this:  [name]host:token: When I ran my code, everything worked fi...

  • 190 Views
  • 4 replies
  • 1 kudos
Latest Reply
dkushari
Databricks Employee
  • 1 kudos

Hi @DatabricksEngi1 - Please ensure you have a Python Venv set up for each Python version that you use with Databricks Connect. Also, I have given step-by-step ways to debug the issue, clear the cache, etc [Read the files and instructions carefully b...

  • 1 kudos
3 More Replies
manugarri
by New Contributor II
  • 18810 Views
  • 12 replies
  • 2 kudos

Fuzzy text matching in Spark

I have a list of client provided data, a list of company names. I have to match those names with an internal database of company names. The client list can fit in memory (its about 10k elements) but the internal dataset is on hdfs and we use Spark ...

  • 18810 Views
  • 12 replies
  • 2 kudos
Latest Reply
Shamzaa3Q
New Contributor II
  • 2 kudos

+1 for rapidfuzz, I have used it in production pipelines. Better than just levenshtein function, as rapidfuzz provides a couple of other algorithms as well. I will warn you to not do what 2024 me attempted, which is use LLM to solve for this. It soun...

  • 2 kudos
11 More Replies
pranaav93
by New Contributor III
  • 89 Views
  • 1 replies
  • 1 kudos

Resolved! TransformWithState is not emitting for live streams

Hi Team, For one of my custom logics i went with transformwithState processor. However it is not working for live stream inputs., I have a start date filter on my df_base and when I give start date that is not current, the processor computes df_loss ...

Data Engineering
apachespark
pyspark
StatefulStreaming
StructuredStreaming
transformWithState
  • 89 Views
  • 1 replies
  • 1 kudos
Latest Reply
pranaav93
New Contributor III
  • 1 kudos

I managed to solve this. The issue was with how I handled the value state in the def init method. It was handled as a dataframe which caused the state to never materialize nor update therefore emitting nulls.I changed them to a tuple of values and th...

  • 1 kudos

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!

Sign Up Now
Labels