cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

Delta Table with 130 columns taking time

Development
New Contributor III

Hi All,

We are facing one un-usual issue while loading data into Delta table using Spark SQL. We have one delta table which have around 135 columns and also having PARTITIONED BY. For this trying to load 15 millions of data volume but its not loading data into delta table even after executing command from last 5 hrs. There is one more table which has around 15 columns and data volume is around 25 millions is processing properly and command executed within 5-10 min. Can anyone please help me here to understand the issue.

Thanks.

8 REPLIES 8

User16752247014
New Contributor II

@rakesh saini​ , Partition By works best with medium-cardinality data and data that is >100GBS, anything that doesn't fit those two categories won't be a great candidate for partitioning. Instead, you should call OPTIMIZE, which speeds up your operations using Z-ordering. I'd also recommend that you check out the documentation on optimizing performance using file management.

Thanks @George Chirapurath​  for reply ,

We are facing this issue when we load the data into delta first time.

Hi @rakesh saini​ , since you note this is a problem when you're loading into Delta, can you provide more detail on the type of source data that you are trying to load into Delta, such as the data format (JSON, csv, etc)? Typically, a hanging job is due to the read and transform stages, not the write stages.

Other useful information for us to better assist

  • A screenshot of the Explain Plan, and/or the DAG in the Spark UI
  • A screenshot of the cluster metrics, e.g. from the Ganglia UI in Databricks. Perhaps there is a memory or CPU bottleneck.
  • The specs of your Spark cluster. Node types, # of workers, etc.

Kaniz
Community Manager
Community Manager

Hi @rakesh saini​ , Just a friendly follow-up. Do you still need help, or @Parker Temple​  and @George Chirapurath​  's responses help you find the solution? Please let us know.

Development
New Contributor III

Hi @Kaniz Fatma​  thanks for the follow up.

Yes , I am still facing same issue so as @Parker Temple​  mentioned about cluster configurations i.e memory, number of worker nodes etc. so will try to upgrade my ADB cluster first and then will re-load data. Currently am using cluster with 16GB memory space and 3 worker nodes.

Kaniz
Community Manager
Community Manager

Hi @rakesh saini​ , Thank you for the reply. Please keep us updated until you find the best answer to your problem. Remember, we are here to serve you.

Development
New Contributor III

@Kaniz Fatma​ @Parker Temple​  I found an root cause its because of serialization. we are using UDF to drive an column on dataframe, when we are trying to load data into delta table or write data into parquet file we are facing  serialization issue . Can you please help to provide best way to create UDFs or an alternate way for UDF in Scala where its should be have an return type (with some example).

Kaniz
Community Manager
Community Manager

Hi @rakesh saini​ , Thank you for the update.

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.