cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Forum Posts

Splush_
by Visitor
  • 14 Views
  • 0 replies
  • 0 kudos

Cannot cast Decimal to Double

Hey,Im trying to save the contents of a database table to a databrick delta table. The schema right from the database returns the number fields as decimal(38, 10). At least one of the values is too large for this data type. So I try to convert it usi...

  • 14 Views
  • 0 replies
  • 0 kudos
mrkure
by New Contributor
  • 49 Views
  • 2 replies
  • 0 kudos

Databricks connect, set spark config

Hi, Iam using databricks connect to compute with databricks cluster. I need to set some spark configurations, namely spark.files.ignoreCorruptFiles. As I have experienced, setting spark configuration in databricks connect for the current session, has...

  • 49 Views
  • 2 replies
  • 0 kudos
Latest Reply
Walter_C
Databricks Employee
  • 0 kudos

Have you tried setting it up in your code as: from pyspark.sql import SparkSession # Create a Spark session spark = SparkSession.builder \ .appName("YourAppName") \ .config("spark.files.ignoreCorruptFiles", "true") \ .getOrCreate() # Yo...

  • 0 kudos
1 More Replies
Avinash_Narala
by Valued Contributor II
  • 150 Views
  • 7 replies
  • 3 kudos

Resolved! SQL Server to Databricks Migration

Hi,I want to build a python function to migrate SQL Server tables to Databricks.Is there any guide/ best practices on how to do so.It'll be really helpful if there is any.Regards,Avinash N

  • 150 Views
  • 7 replies
  • 3 kudos
Latest Reply
filipniziol
Contributor III
  • 3 kudos

Hi @Avinash_Narala ,If it is lift and shift, then try this:1. Set up Lakehouse Federation to SQL Server2. Use CTAS statements to copy each table into Unity Catalog CREATE TABLE catalog_name.schema_name.table_name AS SELECT * FROM sql_server_catalog_...

  • 3 kudos
6 More Replies
boitumelodikoko
by Contributor
  • 149 Views
  • 2 replies
  • 0 kudos

[RETRIES_EXCEEDED] Error When Displaying DataFrame in Databricks Using Serverless Compute

Hi Databricks Community,I am encountering an issue when trying to display a DataFrame in a Python notebook using serverless compute. The operation seems to fail after several retries, and I get the following error message:[RETRIES_EXCEEDED] The maxim...

  • 149 Views
  • 2 replies
  • 0 kudos
Latest Reply
boitumelodikoko
Contributor
  • 0 kudos

Hi @NandiniN,Thank you for your response and insights. I appreciate you taking the time to help me troubleshoot this issue.To provide more context:DataFrame Details:df_10hz contains high-frequency sensor data, and I am attempting to update its name c...

  • 0 kudos
1 More Replies
SwathiChidurala
by New Contributor II
  • 4330 Views
  • 2 replies
  • 3 kudos

Resolved! deltaformat

Hi,I am a student who learning databricks, In the below code I tried to write data in delta format to a gold layer. I authenticated using the service principle method to read, write and execute data , I assigned the storage blob contributor role, but...

  • 4330 Views
  • 2 replies
  • 3 kudos
Latest Reply
Avinash_Narala
Valued Contributor II
  • 3 kudos

Hi @SwathiChidurala ,The error is because you don't have the folder trip_zone inside the gold folder, so you can try by removing the trip_zone from the location or adding the folder trip_zone inside the gold folder in adls and then try it again.If th...

  • 3 kudos
1 More Replies
garciargs
by New Contributor III
  • 65 Views
  • 1 replies
  • 2 kudos

DLT multiple source table to single silver table generating unexpected result

Hi,I´ve been trying this all day long. I'm build a POC of a pipeline that would be used on my everyday ETL.I have two initial tables, vendas and produtos, and they are as the following:vendas_rawvenda_idproduto_iddata_vendaquantidadevalor_totaldth_in...

  • 65 Views
  • 1 replies
  • 2 kudos
Latest Reply
NandiniN
Databricks Employee
  • 2 kudos

When dealing with Change Data Capture (CDC) in Delta Live Tables, it's crucial to handle out-of-order data correctly. You can use the APPLY CHANGES API to manage this. The APPLY CHANGES API ensures that the most recent data is used by specifying a co...

  • 2 kudos
kazinahian
by New Contributor III
  • 3388 Views
  • 1 replies
  • 1 kudos

How can I create a new calculated field in databricks by using pyspark.

Hello:Great people. I am new to Databricks and pyspark learning. How can I create a new column called "sub_total"? Where I want to group by "category" "subcategory" and "monthly" sales value. Appreciate your empathic solution. 

Data Engineering
calculation
  • 3388 Views
  • 1 replies
  • 1 kudos
Latest Reply
Miguel_Suarez
Databricks Employee
  • 1 kudos

Hi @kazinahian, I believe what you're looking for is the .withColumn() Dataframe method in PySpark. It will allow you to create a new column with aggregations on other columns: https://docs.databricks.com/en/pyspark/basics.html#create-columns Best

  • 1 kudos
HoussemBL
by New Contributor III
  • 93 Views
  • 2 replies
  • 0 kudos

External tables in DLT pipelines

Hello community,I have implemented a DLT pipeline.In the "Destination" setting of the pipeline I have specified a unity catalog with target schema of type external referring to an S3 destination.My DLT pipeline works well. Yet, I noticed that all str...

  • 93 Views
  • 2 replies
  • 0 kudos
Latest Reply
Alberto_Umana
Databricks Employee
  • 0 kudos

Hello @HoussemBL, You can use below code example: import dlt @dlt.create_streaming_table(name="your_table_name",path="s3://your-bucket/your-path/",schema="schema-definition")def your_table_function():return ( spark.readStream.format("your_format").op...

  • 0 kudos
1 More Replies
santhoshKumarV
by New Contributor II
  • 159 Views
  • 2 replies
  • 2 kudos

Code coverage on Databricks notebook

I have a scenario where my application code a scala package and notebook code[Scala] under /resources folder is being maitained.I am trying to look for a easiest way to perform code coverage on my notebook , does Databricks provide any option for it....

  • 159 Views
  • 2 replies
  • 2 kudos
Latest Reply
santhoshKumarV
New Contributor II
  • 2 kudos

Important thing which missed to add in post is , we do maintan notebook code as .scala under resources and maitian in github. Files(.scala) from resources gets deployed as notebook using github action.With my approach of moving under package, I will ...

  • 2 kudos
1 More Replies
matthiasn
by New Contributor II
  • 407 Views
  • 6 replies
  • 0 kudos

Resolved! Use temporary table credentials to access data in Databricks

Hi everybody,I tested the temporary table credentials API. I works great, as long as I use the credentials outside of Databricks (e.g. in a local duckdb instance).But as soon as I try to use the short living credentials (Azure SAS for me) in Databric...

  • 407 Views
  • 6 replies
  • 0 kudos
Latest Reply
Walter_C
Databricks Employee
  • 0 kudos

Hello Matthias, many thanks for sharing this valuable information, it is great to hear your issue got resolved.

  • 0 kudos
5 More Replies
KosmaS
by New Contributor III
  • 902 Views
  • 3 replies
  • 1 kudos

Skewness / Salting with countDistinct

Hey Everyone,I experience data skewness for: df = (source_df .unionByName(source_df.withColumn("region", lit("Country"))) .groupBy("zip_code", "region", "device_type") .agg(countDistinct("device_id").alias("total_active_unique"), count("device_id").a...

Screenshot 2024-08-05 at 17.24.08.png
  • 902 Views
  • 3 replies
  • 1 kudos
Latest Reply
Avinash_Narala
Valued Contributor II
  • 1 kudos

you can make use of databricks native feature "Liquid Clustering", cluster by the columns which you want to use in grouping statements, it will handle the performance issue due to data skewness .For more information, please do visit :https://docs.dat...

  • 1 kudos
2 More Replies
analytics_eng
by New Contributor II
  • 286 Views
  • 2 replies
  • 0 kudos

Connection reset by peer logging when importing custom package

Hi! I'm trying to import a custom package I published to Azure Artifacts, but I keep seeing the INFO logging below, which I don't want to display. The package was installed correctly on the cluster, and it imports successfully, but the log still appe...

  • 286 Views
  • 2 replies
  • 0 kudos
Latest Reply
analytics_eng
New Contributor II
  • 0 kudos

Thanks for the suggestions. I investigated all of the above, but they didn't provide a solution. What did work was using another logging package within my custom package: Loguru. Not sure why this helped?

  • 0 kudos
1 More Replies
Labels