Data Engineering

Forum Posts

Sorted by:

by PragyaS • New Contributor

06-08-2023 11:20:45 PM

2376 Views
2 replies
1 kudos

Resolved! Decimal Automatic Rounding on sum

I am facing an issue, I have implemented a code in which I am performing sum on decimal values. I have set precision as Decimal(19,2). When I am calculating for big data I am getting different value as I got from my .Net Utility application. ex. From...

Data Engineering

2376 Views
2 replies
1 kudos

06-08-2023 11:20:45 PM

View Replies

Latest Reply

Anonymous
Not applicable

06-13-2023 11:10:05 PM

1 kudos

Hi @Pragya Sharma We haven't heard from you since the last response from @Werner Stinckens . Kindly share the information with us, and in return, we will provide you with the necessary solution. Thanks and Regards

1 kudos

06-13-2023 11:10:05 PM

1 More Replies

by Satty • New Contributor

05-25-2023 2:57:30 AM

6550 Views
1 replies
0 kudos

Solution for ConnectException error: This is often caused by an OOM error that causes the connection to the Python REPL to be closed. Check your query's memory usage.

When ever I am trying to run and load multiple files in single dataframe for processing (overall file size is more than 15 gb in single dataframe at the end of the loop, my code is crashing everytime with the below error...ConnectException error: Thi...

Data Engineering

6550 Views
1 replies
0 kudos

05-25-2023 2:57:30 AM

View Replies

Latest Reply

pvignesh92
Honored Contributor

05-26-2023 3:27:55 AM

0 kudos

@Satish Agarwal It seems your system memory is not sufficient to load the 15GB file. I believe you are using Python Pandas data frame for loading 15GB file and not using Spark. Is there any particular reason that you cannot use Spark for this.

0 kudos

05-26-2023 3:27:55 AM

by Erik_L • Contributor II

04-21-2023 10:46:09 AM

8308 Views
2 replies
2 kudos

Joining a big amount of data causes "Out of disk space error", how to ingest?

What I am trying to dodf = None # For all of the IDs that are valid for id in ids: # Get the parts of the data from different sources df_1 = spark.read.parquet(url_for_id) df_2 = spark.read.parquet(url_for_id) ... # Join together the pa...

Data Engineering

8308 Views
2 replies
2 kudos

04-21-2023 10:46:09 AM

View Replies

Latest Reply

Anonymous
Not applicable

04-25-2023 10:16:09 PM

2 kudos

@Erik Louie :There are several strategies that you can use to handle large joins like this in Spark:Use a broadcast join: If one of your dataframes is relatively small (less than 10-20 GB), you can use a broadcast join to avoid shuffling data. A bro...

2 kudos

04-25-2023 10:16:09 PM

1 More Replies

by Rishabh-Pandey • Esteemed Contributor

03-09-2023 8:19:11 AM

1547 Views
2 replies
5 kudos

"Hey everyone, it seems like there's some confusion about enhanced autoscaling in Databricks lately. If you're feeling lost or unsure abo...

"Hey everyone, it seems like there's some confusion about enhanced autoscaling in Databricks lately. If you're feeling lost or unsure about how it works, don't worry - you're not"Enhanced autoscaling is a feature in Databricks that enables dynamic sc...

Data Engineering

1547 Views
2 replies
5 kudos

03-09-2023 8:19:11 AM

View Replies

Latest Reply

Ajay-Pandey
Esteemed Contributor III

03-09-2023 8:07:41 PM

5 kudos

Very informativeThanks for sharing!

5 kudos

03-09-2023 8:07:41 PM

1 More Replies

by spartakos • New Contributor

06-30-2022 8:29:42 AM

845 Views
0 replies
0 kudos

Big data ingest into Delta Lake

I have a feature table in BQ that I want to ingest into Delta Lake. This feature table in BQ has 100TB of data. This table can be partitioned by DATE.What best practices and approaches can I take to ingest this 100TB? In particular, what can I do to ...

Data Engineering

845 Views
0 replies
0 kudos

06-30-2022 8:29:42 AM

by PJ • New Contributor III

04-21-2022 12:45:17 PM

2379 Views
3 replies
3 kudos

Resolved! How should you optimize <1GB delta tables?

I have seen the following documentation that details how you can work with the OPTIMIZE function to improve storage and querying efficiency. However, most of the documentation focuses on big data, 10 GB or larger. I am working with a ~7million row ...

Data Engineering

2379 Views
3 replies
3 kudos

04-21-2022 12:45:17 PM

View Replies

Latest Reply

PJ
New Contributor III

04-21-2022 1:45:33 PM

3 kudos

Thank you @Hubert Dudek !! So I gather from your response that it's totally fine to have a delta table that lives under 1 file that's roughly 211 MB. And I can use OPTIMIZE in conjunction with ZORDER to filter on a frequently filtered, high cardina...

3 kudos

04-21-2022 1:45:33 PM

2 More Replies

by ninjadev999 • New Contributor II

02-11-2022 12:15:11 AM

7049 Views
7 replies
1 kudos

Resolved! Can't write big DataFrame into MSSQL server by using jdbc driver on Azure Databricks

I'm reading a huge csv file including 39,795,158 records and writing into MSSQL server, on Azure Databricks. The Databricks(notebook) is running on a cluster node with 56 GB Memory, 16 Cores, and 12 workers.This is my code in Python and PySpark:from ...

Data Engineering

7049 Views
7 replies
1 kudos

02-11-2022 12:15:11 AM

View Replies

Latest Reply

User16764241763
Honored Contributor

02-15-2022 6:29:06 AM

1 kudos

Hi,If you are using Azure SQL DB Managed instance, could you please file a support request with Azure team? This is to review any timeouts, perf issues on the backend.Also, it seems like the timeout is coming from SQL Server which is closing the conn...

1 kudos

02-15-2022 6:29:06 AM

6 More Replies

by User15787040559 • Databricks Employee

06-22-2021 4:09:52 PM

4019 Views
1 replies
0 kudos

How many records does Spark use to infer the schema? entire file or just the first "X" number of records?

It depends. If you specify the schema it will be zero, otherwise it will do a full file scan which doesn’t work well processing Big Data at a large scale.CSV files Dataframe Reader https://spark.apache.org/docs/latest/api/python/reference/api/pyspark...

Data Engineering

4019 Views
1 replies
0 kudos

06-22-2021 4:09:52 PM

View Replies

Latest Reply

aladda
Databricks Employee

06-22-2021 9:09:15 PM

0 kudos

As indicated there are ways to manage the amount of data being sampled for inferring schema. However as a best practice for production workloads its always best to define the schema explicitly for consistency, repeatability and robustness of the pipe...

0 kudos

06-22-2021 9:09:15 PM

Databricks Community

Resolved! Decimal Automatic Rounding on sum

Solution for ConnectException error: This is often caused by an OOM error that causes the connection to the Python REPL to be closed. Check your query's memory usage.

Joining a big amount of data causes "Out of disk space error", how to ingest?

"Hey everyone, it seems like there&#39;s some confusion about enhanced autoscaling in Databricks lately. If you&#39;re feeling lost or unsure abo...

Big data ingest into Delta Lake

Resolved! How should you optimize <1GB delta tables?

Resolved! Can't write big DataFrame into MSSQL server by using jdbc driver on Azure Databricks

How many records does Spark use to infer the schema? entire file or just the first "X" number of records?

"Hey everyone, it seems like there's some confusion about enhanced autoscaling in Databricks lately. If you're feeling lost or unsure abo...