cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Forum Posts

PragyaS
by New Contributor
  • 1771 Views
  • 2 replies
  • 1 kudos

Resolved! Decimal Automatic Rounding on sum

I am facing an issue, I have implemented a code in which I am performing sum on decimal values. I have set precision as Decimal(19,2). When I am calculating for big data I am getting different value as I got from my .Net Utility application. ex. From...

  • 1771 Views
  • 2 replies
  • 1 kudos
Latest Reply
Anonymous
Not applicable
  • 1 kudos

Hi @Pragya Sharma​ We haven't heard from you since the last response from @Werner Stinckens​  . Kindly share the information with us, and in return, we will provide you with the necessary solution. Thanks and Regards

  • 1 kudos
1 More Replies
Satty
by New Contributor
  • 5939 Views
  • 1 replies
  • 0 kudos

Solution for ConnectException error: This is often caused by an OOM error that causes the connection to the Python REPL to be closed. Check your query's memory usage.

When ever I am trying to run and load multiple files in single dataframe for processing (overall file size is more than 15 gb in single dataframe at the end of the loop, my code is crashing everytime with the below error...ConnectException error: Thi...

  • 5939 Views
  • 1 replies
  • 0 kudos
Latest Reply
pvignesh92
Honored Contributor
  • 0 kudos

@Satish Agarwal​ It seems your system memory is not sufficient to load the 15GB file. I believe you are using Python Pandas data frame for loading 15GB file and not using Spark. Is there any particular reason that you cannot use Spark for this.

  • 0 kudos
Erik_L
by Contributor II
  • 6513 Views
  • 2 replies
  • 2 kudos

Joining a big amount of data causes "Out of disk space error", how to ingest?

What I am trying to dodf = None   # For all of the IDs that are valid for id in ids: # Get the parts of the data from different sources df_1 = spark.read.parquet(url_for_id) df_2 = spark.read.parquet(url_for_id) ...   # Join together the pa...

  • 6513 Views
  • 2 replies
  • 2 kudos
Latest Reply
Anonymous
Not applicable
  • 2 kudos

@Erik Louie​ :There are several strategies that you can use to handle large joins like this in Spark:Use a broadcast join: If one of your dataframes is relatively small (less than 10-20 GB), you can use a broadcast join to avoid shuffling data. A bro...

  • 2 kudos
1 More Replies
Rishabh-Pandey
by Esteemed Contributor
  • 1214 Views
  • 2 replies
  • 5 kudos

"Hey everyone, it seems like there's some confusion about enhanced autoscaling in Databricks lately. If you're feeling lost or unsure abo...

"Hey everyone, it seems like there's some confusion about enhanced autoscaling in Databricks lately. If you're feeling lost or unsure about how it works, don't worry - you're not"Enhanced autoscaling is a feature in Databricks that enables dynamic sc...

  • 1214 Views
  • 2 replies
  • 5 kudos
Latest Reply
Ajay-Pandey
Esteemed Contributor III
  • 5 kudos

Very informativeThanks for sharing!

  • 5 kudos
1 More Replies
spartakos
by New Contributor
  • 691 Views
  • 0 replies
  • 0 kudos

Big data ingest into Delta Lake

I have a feature table in BQ that I want to ingest into Delta Lake. This feature table in BQ has 100TB of data. This table can be partitioned by DATE.What best practices and approaches can I take to ingest this 100TB? In particular, what can I do to ...

  • 691 Views
  • 0 replies
  • 0 kudos
PJ
by New Contributor III
  • 1692 Views
  • 3 replies
  • 3 kudos

Resolved! How should you optimize <1GB delta tables?

I have seen the following documentation that details how you can work with the OPTIMIZE function to improve storage and querying efficiency. However, most of the documentation focuses on big data, 10 GB or larger. I am working with a ~7million row ...

  • 1692 Views
  • 3 replies
  • 3 kudos
Latest Reply
PJ
New Contributor III
  • 3 kudos

Thank you @Hubert Dudek​ !! So I gather from your response that it's totally fine to have a delta table that lives under 1 file that's roughly 211 MB. And I can use OPTIMIZE in conjunction with ZORDER to filter on a frequently filtered, high cardina...

  • 3 kudos
2 More Replies
ninjadev999
by New Contributor II
  • 5807 Views
  • 7 replies
  • 1 kudos

Resolved! Can't write big DataFrame into MSSQL server by using jdbc driver on Azure Databricks

I'm reading a huge csv file including 39,795,158 records and writing into MSSQL server, on Azure Databricks. The Databricks(notebook) is running on a cluster node with 56 GB Memory, 16 Cores, and 12 workers.This is my code in Python and PySpark:from ...

  • 5807 Views
  • 7 replies
  • 1 kudos
Latest Reply
User16764241763
Honored Contributor
  • 1 kudos

Hi,If you are using Azure SQL DB Managed instance, could you please file a support request with Azure team? This is to review any timeouts, perf issues on the backend.Also, it seems like the timeout is coming from SQL Server which is closing the conn...

  • 1 kudos
6 More Replies
User15787040559
by New Contributor III
  • 3337 Views
  • 1 replies
  • 0 kudos

How many records does Spark use to infer the schema? entire file or just the first "X" number of records?

It depends. If you specify the schema it will be zero, otherwise it will do a full file scan which doesn’t work well processing Big Data at a large scale.CSV files Dataframe Reader https://spark.apache.org/docs/latest/api/python/reference/api/pyspark...

  • 3337 Views
  • 1 replies
  • 0 kudos
Latest Reply
aladda
Honored Contributor II
  • 0 kudos

As indicated there are ways to manage the amount of data being sampled for inferring schema. However as a best practice for production workloads its always best to define the schema explicitly for consistency, repeatability and robustness of the pipe...

  • 0 kudos
Labels