Data Engineering

Forum Posts

Sorted by:

by Splush_ • Visitor

13 hours ago

14 Views
0 replies
0 kudos

Cannot cast Decimal to Double

Hey,Im trying to save the contents of a database table to a databrick delta table. The schema right from the database returns the number fields as decimal(38, 10). At least one of the values is too large for this data type. So I try to convert it usi...

Data Engineering

14 Views
0 replies
0 kudos

13 hours ago

by mrkure • New Contributor

yesterday

49 Views
2 replies
0 kudos

Databricks connect, set spark config

Hi, Iam using databricks connect to compute with databricks cluster. I need to set some spark configurations, namely spark.files.ignoreCorruptFiles. As I have experienced, setting spark configuration in databricks connect for the current session, has...

Data Engineering

49 Views
2 replies
0 kudos

yesterday

View Replies

Latest Reply

Walter_C
Databricks Employee

yesterday

0 kudos

Have you tried setting it up in your code as: from pyspark.sql import SparkSession # Create a Spark session spark = SparkSession.builder \ .appName("YourAppName") \ .config("spark.files.ignoreCorruptFiles", "true") \ .getOrCreate() # Yo...

0 kudos

yesterday

1 More Replies

by Avinash_Narala • Valued Contributor II

Friday

150 Views
7 replies
3 kudos

Resolved! SQL Server to Databricks Migration

Hi,I want to build a python function to migrate SQL Server tables to Databricks.Is there any guide/ best practices on how to do so.It'll be really helpful if there is any.Regards,Avinash N

Data Engineering

150 Views
7 replies
3 kudos

Friday

View Replies

Latest Reply

filipniziol
Contributor III

Sunday

3 kudos

Hi @Avinash_Narala ,If it is lift and shift, then try this:1. Set up Lakehouse Federation to SQL Server2. Use CTAS statements to copy each table into Unity Catalog CREATE TABLE catalog_name.schema_name.table_name AS SELECT * FROM sql_server_catalog_...

3 kudos

Sunday

6 More Replies

by boitumelodikoko • Contributor

a week ago

149 Views
2 replies
0 kudos

[RETRIES_EXCEEDED] Error When Displaying DataFrame in Databricks Using Serverless Compute

Hi Databricks Community,I am encountering an issue when trying to display a DataFrame in a Python notebook using serverless compute. The operation seems to fail after several retries, and I get the following error message:[RETRIES_EXCEEDED] The maxim...

Data Engineering

149 Views
2 replies
0 kudos

a week ago

View Replies

Latest Reply

boitumelodikoko
Contributor

Sunday

0 kudos

Hi @NandiniN,Thank you for your response and insights. I appreciate you taking the time to help me troubleshoot this issue.To provide more context:DataFrame Details:df_10hz contains high-frequency sensor data, and I am attempting to update its name c...

0 kudos

Sunday

1 More Replies

by SwathiChidurala • New Contributor II

Saturday

4330 Views
2 replies
3 kudos

Resolved! deltaformat

Hi,I am a student who learning databricks, In the below code I tried to write data in delta format to a gold layer. I authenticated using the service principle method to read, write and execute data , I assigned the storage blob contributor role, but...

Data Engineering

4330 Views
2 replies
3 kudos

Saturday

View Replies

Latest Reply

Avinash_Narala
Valued Contributor II

Sunday

3 kudos

Hi @SwathiChidurala ,The error is because you don't have the folder trip_zone inside the gold folder, so you can try by removing the trip_zone from the location or adding the folder trip_zone inside the gold folder in adls and then try it again.If th...

3 kudos

Sunday

1 More Replies

by garciargs • New Contributor III

a week ago

65 Views
1 replies
2 kudos

DLT multiple source table to single silver table generating unexpected result

Hi,I´ve been trying this all day long. I'm build a POC of a pipeline that would be used on my everyday ETL.I have two initial tables, vendas and produtos, and they are as the following:vendas_rawvenda_idproduto_iddata_vendaquantidadevalor_totaldth_in...

Data Engineering

65 Views
1 replies
2 kudos

a week ago

View Replies

Latest Reply

NandiniN
Databricks Employee

Sunday

2 kudos

When dealing with Change Data Capture (CDC) in Delta Live Tables, it's crucial to handle out-of-order data correctly. You can use the APPLY CHANGES API to manage this. The APPLY CHANGES API ensures that the most recent data is used by specifying a co...

2 kudos

Sunday

by Ivan_Pyrog • New Contributor

Friday

72 Views
0 replies
0 kudos

Azure Event Hub throws Timeout Exceptio: Timed out waiting for a node assignment. Call: describeTopi

Hello team, We are researching the streaming capabilities of our data platform and currently in need of reading data from EVH ( event hub) with our Databricks notebooks. Unfortunately there seems to be an error somewhere due to Timeout Exception: Tim...

Data Engineering

72 Views
0 replies
0 kudos

Friday

by Avinash_Narala • Valued Contributor II

Thursday

45 Views
0 replies
0 kudos

Python udf to pyspark udf conversion

Hi,I want to convert my python udf to pyspark udf, is there any guide/article on suggesting the best practices and avoid miscalculations if any

Data Engineering

45 Views
0 replies
0 kudos

Thursday

by kazinahian • New Contributor III

09-20-2023 2:21:08 PM

3388 Views
1 replies
1 kudos

How can I create a new calculated field in databricks by using pyspark.

Hello:Great people. I am new to Databricks and pyspark learning. How can I create a new column called "sub_total"? Where I want to group by "category" "subcategory" and "monthly" sales value. Appreciate your empathic solution.

Data Engineering

calculation

3388 Views
1 replies
1 kudos

09-20-2023 2:21:08 PM

View Replies

Latest Reply

Miguel_Suarez
Databricks Employee

Wednesday

1 kudos

Hi @kazinahian, I believe what you're looking for is the .withColumn() Dataframe method in PySpark. It will allow you to create a new column with aggregations on other columns: https://docs.databricks.com/en/pyspark/basics.html#create-columns Best

1 kudos

Wednesday

by HoussemBL • New Contributor III

a week ago

93 Views
2 replies
0 kudos

External tables in DLT pipelines

Hello community,I have implemented a DLT pipeline.In the "Destination" setting of the pipeline I have specified a unity catalog with target schema of type external referring to an S3 destination.My DLT pipeline works well. Yet, I noticed that all str...

Data Engineering

93 Views
2 replies
0 kudos

a week ago

View Replies

Latest Reply

Alberto_Umana
Databricks Employee

a week ago

0 kudos

Hello @HoussemBL, You can use below code example: import dlt @dlt.create_streaming_table(name="your_table_name",path="s3://your-bucket/your-path/",schema="schema-definition")def your_table_function():return ( spark.readStream.format("your_format").op...

0 kudos

a week ago

1 More Replies

by hari-prasad • Valued Contributor II

Wednesday

104 Views
0 replies
1 kudos

'from_json' spark function not parsing value column from Confluent Kafka topic

For one of badge completion, it was mandatory to complete a Spark Streaming Demo Practice.Due to the absence of a Kafka broker setup required for the demo practice, I configured a Confluent Kafka cluster and made several modifications to the Spark sc...

Data Engineering

104 Views
0 replies
1 kudos

Wednesday

by santhoshKumarV • New Contributor II

2 weeks ago

159 Views
2 replies
2 kudos

Code coverage on Databricks notebook

I have a scenario where my application code a scala package and notebook code[Scala] under /resources folder is being maitained.I am trying to look for a easiest way to perform code coverage on my notebook , does Databricks provide any option for it....

Data Engineering

159 Views
2 replies
2 kudos

2 weeks ago

View Replies

Latest Reply

santhoshKumarV
New Contributor II

Tuesday

2 kudos

Important thing which missed to add in post is , we do maintan notebook code as .scala under resources and maitian in github. Files(.scala) from resources gets deployed as notebook using github action.With my approach of moving under package, I will ...

2 kudos

Tuesday

1 More Replies

by matthiasn • New Contributor II

3 weeks ago

407 Views
6 replies
0 kudos

Resolved! Use temporary table credentials to access data in Databricks

Hi everybody,I tested the temporary table credentials API. I works great, as long as I use the credentials outside of Databricks (e.g. in a local duckdb instance).But as soon as I try to use the short living credentials (Azure SAS for me) in Databric...

Data Engineering

407 Views
6 replies
0 kudos

3 weeks ago

View Replies

Latest Reply

Walter_C
Databricks Employee

a week ago

0 kudos

Hello Matthias, many thanks for sharing this valuable information, it is great to hear your issue got resolved.

0 kudos

a week ago

5 More Replies

by KosmaS • New Contributor III

08-05-2024 8:25:15 AM

902 Views
3 replies
1 kudos

Skewness / Salting with countDistinct

Hey Everyone,I experience data skewness for: df = (source_df .unionByName(source_df.withColumn("region", lit("Country"))) .groupBy("zip_code", "region", "device_type") .agg(countDistinct("device_id").alias("total_active_unique"), count("device_id").a...

Data Engineering

902 Views
3 replies
1 kudos

08-05-2024 8:25:15 AM

View Replies

Latest Reply

Avinash_Narala
Valued Contributor II

a week ago

1 kudos

you can make use of databricks native feature "Liquid Clustering", cluster by the columns which you want to use in grouping statements, it will handle the performance issue due to data skewness .For more information, please do visit :https://docs.dat...

1 kudos

a week ago

2 More Replies

by analytics_eng • New Contributor II

3 weeks ago

286 Views
2 replies
0 kudos

Connection reset by peer logging when importing custom package

Hi! I'm trying to import a custom package I published to Azure Artifacts, but I keep seeing the INFO logging below, which I don't want to display. The package was installed correctly on the cluster, and it imports successfully, but the log still appe...

Data Engineering

286 Views
2 replies
0 kudos

3 weeks ago

View Replies

Latest Reply

analytics_eng
New Contributor II

a week ago

0 kudos

Thanks for the suggestions. I investigated all of the above, but they didn't provide a solution. What did work was using another logging package within my custom package: Loguru. Not sure why this helped?

0 kudos

a week ago

1 More Replies