cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Forum Posts

jeremy98
by Honored Contributor
  • 2640 Views
  • 5 replies
  • 1 kudos

Set serveless compute environment to a task of a job

Hi Community,I want to set the environment of a task inside in a job using DABs, but I got this error.I could achieve my goal, if I set manually the task inside to be environment 2, because I need to use Python 3.11.How can I do it through DABs?

jeremy98_0-1738149373540.png
  • 2640 Views
  • 5 replies
  • 1 kudos
Latest Reply
jeremy98
Honored Contributor
  • 1 kudos

Hi,Seems that this could be set for spark_python_task:resources: jobs: New_Job_Jan_29_2025_at_11_48_AM: name: New Job Jan 29, 2025 at 11:48 AM tasks: - task_key: test-py-version2 spark_python_task: pyth...

  • 1 kudos
4 More Replies
panganibana
by New Contributor II
  • 1070 Views
  • 1 replies
  • 0 kudos

Resolved! Inconsistency on Dataframe queried from External Data Source

We have a Catalog pointing to an External Data Source (Google BigQuery).1) In a notebook, create a cell where it runs a query to populate a Dataframe. Display results.2) Create another cell below and display the same Dataframe.3) I get different resu...

Data Engineering
externaldata
  • 1070 Views
  • 1 replies
  • 0 kudos
Latest Reply
crystal548
New Contributor III
  • 0 kudos

@panganibana wrote:We have a Catalog pointing to an External Data Source (Google BigQuery).1) In a notebook, create a cell where it runs a query to populate a Dataframe. Display results.2) Create another cell below and display the same Dataframe.3) I...

  • 0 kudos
markbaas
by New Contributor III
  • 12359 Views
  • 9 replies
  • 0 kudos

DBFS_DOWN

I have an Azure Databricks workspace with Unity Catalog setup, using VNet and private endpoints. Serverless works great; however, the regular clusters have problems showing large results:Failed to store the result. Try rerunning the command. Failed ...

  • 12359 Views
  • 9 replies
  • 0 kudos
Latest Reply
markbaas
New Contributor III
  • 0 kudos

The dbfs (dbstorage) resource in the managed azure resource group needs to have private endpoints to your virtual network. You can create those manually or through iac (bicep/terraform).

  • 0 kudos
8 More Replies
sdes10
by New Contributor II
  • 2884 Views
  • 3 replies
  • 0 kudos

DLT apply_as_deletes not working on existing data with full refresh

I have an existing DLT pipeline that works on a modified medallion architecture. Data is sent from debezium to kafka and lands into a bronze table. From bronze table, it goes to a silver table where it is schematized. Finally to a good table where I ...

  • 2884 Views
  • 3 replies
  • 0 kudos
Latest Reply
sdes10
New Contributor II
  • 0 kudos

@Sidhant07 how do i use skipChangeCommits? The idea is that i have a bronze, silver and gold table already built. Now i am enabling deletes on gold table in the apply_changes API. The silver table is added with operation column (values c,u,r,d). I di...

  • 0 kudos
2 More Replies
Abdurrahman
by New Contributor II
  • 1642 Views
  • 3 replies
  • 0 kudos

How can I save a large spark table (~88.3Mn rows) to a delta lake table

I am trying to add a column to an existing delta lake table by adding a column and saving the table as a new table. The spark driver is getting overloaded. I have databricks notebook to work with (I have a decent compute as well g5.12xlarge) and have...

  • 1642 Views
  • 3 replies
  • 0 kudos
Latest Reply
Amit_Dass
New Contributor III
  • 0 kudos

Hi @Abdurrahman, Addition to the Sidhant07, I assumed you are adding this new column and you may be using this column in query, Use the ZORDER & OPTIMIZE both. ZORDER (Highly Recommended): Even more important than just OPTIMIZE for adding columns eff...

  • 0 kudos
2 More Replies
clentin
by Contributor
  • 4282 Views
  • 6 replies
  • 0 kudos

Import Py File

How do i import a .py file in Databricks environment?Any help will be appreciated. 

  • 4282 Views
  • 6 replies
  • 0 kudos
Latest Reply
fifata
New Contributor II
  • 0 kudos

@filipniziol @tejaswi24 Sorry to bring this up again, but I'm facing kind of similar problem.We have Databricks Repos that is a copy of a GitHub repository. The GitHub contains only .py files, but when copied to Databricks, they all get converted to ...

  • 0 kudos
5 More Replies
Splush_
by New Contributor III
  • 11839 Views
  • 5 replies
  • 6 kudos

Cannot cast Decimal to Double

Hey,Im trying to save the contents of a database table to a databrick delta table. The schema right from the database returns the number fields as decimal(38, 10). At least one of the values is too large for this data type. So I try to convert it usi...

  • 11839 Views
  • 5 replies
  • 6 kudos
Latest Reply
Splush_
New Contributor III
  • 6 kudos

Hey guys,Thank you a lot for your help. Since this is taking days alreary, I have asked the application owners of the database to delete these values for me. Apparently they are weights in gram for whatever products - so the problematic rows are heav...

  • 6 kudos
4 More Replies
susanne
by Databricks Partner
  • 1873 Views
  • 2 replies
  • 2 kudos

Resolved! Views in DLT with Private Preview feature Direct Publish

Hi everyone,I am building a dlt Pipeline and there I am using the Direct Publish feature which is as of now still under Private Preview.While it works well to create streaming tables and write them to another schema than the dlt  default schema, I ge...

  • 1873 Views
  • 2 replies
  • 2 kudos
Latest Reply
susanne
Databricks Partner
  • 2 kudos

Hi Sidhan,thanks a lot for your reply, it works very well to write materialized views to a different schema than the default schema.Thanks for your guidance!Best regardsSusanne

  • 2 kudos
1 More Replies
AlexVB
by New Contributor III
  • 4916 Views
  • 2 replies
  • 0 kudos

Catalogue global UDF's

The current UDF implementation stores UDFs in a catalogue.schema location. This requires reference/call to said udf location; example `select my_catalogue.my_schema.my_udf()`. Or have the sql execute from that schema.In Snowflake, UDFs are globally a...

  • 4916 Views
  • 2 replies
  • 0 kudos
Latest Reply
Sidhant07
Databricks Employee
  • 0 kudos

Hi @AlexVB , The current UDF implementation in Databricks requires referencing the UDF location with select my_catalogue.my_schema.my_udf() or executing SQL from that schema because Databricks organizes database objects using a three-tier hierarchy: ...

  • 0 kudos
1 More Replies
messiah
by Databricks Partner
  • 3669 Views
  • 3 replies
  • 0 kudos

Unable to Read Data from S3 in Databricks (AWS Free Trial)

Hey Community, I recently signed up for a Databricks free trial on AWS and created a workspace using the quickstart method. After setting up my cluster and opening a notebook, I tried to read a Parquet file from S3 using: spark.read.parquet("s3://<bu...

  • 3669 Views
  • 3 replies
  • 0 kudos
Latest Reply
Sidhant07
Databricks Employee
  • 0 kudos

Hi @messiah , This occurs due to the lack of AWS credentials or IAM roles necessary to access the S3 bucket. Can you please check the AWS Credentials, IAM Roles and IAM Permissions: Make sure the IAM role associated with the instance profile has......

  • 0 kudos
2 More Replies
subhas_hati
by New Contributor
  • 7405 Views
  • 1 replies
  • 1 kudos

Partition Size:

HiI have chosen the default partition size 128 MB. I am reading a 3.8 GB file and checking the size of partition using df.rdd.getNumPartitions() as given below. I find the partition size: 159 MB. Why the partition size after reading the file differ ?...

  • 7405 Views
  • 1 replies
  • 1 kudos
Latest Reply
Sidhant07
Databricks Employee
  • 1 kudos

Hi @subhas_hati , The partition size of a 3.8 GB file read into a DataFrame differs from the default partition size of 128 MB, resulting in a partition size of 159 MB, due to the influence of the spark.sql.files.openCostInBytes configuration.• spark....

  • 1 kudos
Avinash_Narala
by Databricks Partner
  • 917 Views
  • 1 replies
  • 0 kudos

Python udf to pyspark udf conversion

Hi,I want to convert my python udf to pyspark udf, is there any guide/article on suggesting the best practices and avoid miscalculations if any 

  • 917 Views
  • 1 replies
  • 0 kudos
Latest Reply
Sidhant07
Databricks Employee
  • 0 kudos

Hi, Can you please share some more details. Let me know if this helps? https://docs.databricks.com/en/udf/unity-catalog.html

  • 0 kudos
norbitek
by New Contributor II
  • 1891 Views
  • 1 replies
  • 0 kudos

Identity column and impact on performance

Hi,I want to define identity column in the Delta table.Based on documentation:"Declaring an identity column on a Delta table disables concurrent transactions. Only use identity columns in use cases where concurrent writes to the target table are not ...

  • 1891 Views
  • 1 replies
  • 0 kudos
Latest Reply
Sidhant07
Databricks Employee
  • 0 kudos

The use of an identity column in a Delta table affects the execution of a MERGE statement by disabling concurrent transactions. This constraint means that when performing operations such as upserting or deleting data, the identity column enforces tha...

  • 0 kudos
JulianKrüger
by New Contributor
  • 2304 Views
  • 1 replies
  • 0 kudos

Limited concurrent running DLT's within a pipeline

Hi Champions!We are currently experiencing a few unexplainable limitations when executing pipelines with > 50 DLT tables. It looks like, that there is some calculation in the background in place, to determine the maximum number of concurrent running ...

JulianKrger_0-1737625978665.png JulianKrger_1-1737626627178.png JulianKrger_2-1737626698802.png
Data Engineering
dlt
pipeline
  • 2304 Views
  • 1 replies
  • 0 kudos
Latest Reply
Sidhant07
Databricks Employee
  • 0 kudos

Hi @JulianKrüger , • The "num_task_slots" parameter in Databricks Delta Live Tables (DLT) pipelines is related to the concurrency of tasks within a pipeline. It determines the number of concurrent tasks that can be executed. However, this parameter d...

  • 0 kudos
sparkplug
by Contributor
  • 1842 Views
  • 4 replies
  • 1 kudos

Databricks logging of SQL queries to DBFS

HiOur costs has suddenly spiked due to logging of a lot of SQL query outputs to DBFS. We haven't made any changes to enable this. How can we disable this feature?

  • 1842 Views
  • 4 replies
  • 1 kudos
Latest Reply
sparkplug
Contributor
  • 1 kudos

I don't get any output when running the following, I have the destination set to dbfs . But it was only supposed to be for cluster logs and not for query execution outputs to be stored in DBFS. Any idea if this is expected behavior.spark.conf.get("sp...

  • 1 kudos
3 More Replies
Labels