cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

Databricks snowflake dataframe.toPandas() taking more space and time

brij
New Contributor III

I have 2 exactly same table(rows and schema). One table recides in AZSQL server data base and other one is in snowflake database. Now we have some existing code which we want to migrate from azsql to snowflake but when we are trying to create a pandas dataframe from the snowflake its takes 20 to 30 mins to do it(3.8 millon records) and in AZSQL same task, we are able to do it in 5 mins. Also to perfome toPandas for AZSQL we were using really small single node cluster and for snowflake we have to use a cluster with 56 gb of memory(otherwise we get OOM error).

1 ACCEPTED SOLUTION

Accepted Solutions

brij
New Contributor III

I was able to find the issue. it was happening because, python doesn't understand spark's decimal data type so, spark treats it as object and that's why rest of the python code was taking so much of time.

And after explicitly type casting it its even faster 😎 😃 . thanks guys for input.

here is the code, i am using to handle all the decimal column, Let us know if there is a better way :

def fix_decimal_type(df):
    decimal_columns = list(filter(None,[(col_ if "decimal" in unseen_df_sp.schema[col_].dataType.simpleString()else '') for col_ in df.columns]))
 
    for col_ in decimal_columns:
          df=df.withColumn(col_, col(col_).cast(DoubleType()))
    return df

View solution in original post

8 REPLIES 8

Kaniz
Community Manager
Community Manager

Hi @ brij! My name is Kaniz, and I'm the technical moderator here. Great to meet you, and thanks for your question! Let's see if your peers on the community have an answer to your question first. Or else I will follow up with my team and get back to you soon.Thanks.

brij
New Contributor III

Hi @Kaniz Fatma​ , Thanks a lot, I know this could be because of how snowflake stores the data but also wanted to check here if someone have faced it.

Ryan_Chynoweth
Honored Contributor III

Just so I understand your issue, can you provide some more information to these questions?

  • You are reading the same data from both Azure SQL and Snowflake into a Spark Dataframe. At that point you are converting the spark dataframe to a Pandas dataframe using the `toPandas()` method?
  • The process described above is taking 4-6 times longer when using Snowflake than when compared to Azure SQL?
  • If the answer to the two questions above is "yes", then I would be curious if the number of partitions for both Dataframes are the same, as that could impact the process.
  • If the answer to the two questions above is "yes", then it is likely the performance of reading data from the databases which is taking so long. This is likely an issue on the Snowflake side not performing as expected when reading data.

My last question, is why are you converting 3.8 million rows to a pandas Dataframe? I would recommend keeping it as a Spark Dataframe as Pandas does start to reach it limitations of processing speed when rows reach the low millions.

brij
New Contributor III

Hi Ryan, Answer to the first 2 question is yes. and partitions for the AZSQL is 1 and for snowflake is 46. Also It's not only about time, with snowflake, We can see a spike in memory consumption as well(in general). While working with snowflake we require more memory, i don't know why? And that's what I am trying to understand. Regarding your suggestion for keeping it in spark data frame, Yes we are keeping it as our last option because it will require lot of code changes, we already have code which works fine with AZSQL(3.8 million records.

thanks for your reply.

Brijan

brij
New Contributor III

I have found one issue with snowflake and this might be related, it could be because of how snowflake stores all of its numeric data type.

https://community.snowflake.com/s/question/0D50Z00008ZddGDSAZ/we-snowflake-convert-smallint-int-data...

Ryan_Chynoweth
Honored Contributor III

Are you reading the entire table from Snowflake into a Dataframe? I would test to see how fast the query runs in Snowflake natively. If you are reading the entire table then select * would do.

This could also be a performance hit from the Spark connector for Snowflake. This connector was built by Snowflake and it works great but you could be seeing a throttle by it.

It seems to me that it is a Snowflake problem because your code for Azure SQL is performant.

brij
New Contributor III

I was able to find the issue. it was happening because, python doesn't understand spark's decimal data type so, spark treats it as object and that's why rest of the python code was taking so much of time.

And after explicitly type casting it its even faster 😎 😃 . thanks guys for input.

here is the code, i am using to handle all the decimal column, Let us know if there is a better way :

def fix_decimal_type(df):
    decimal_columns = list(filter(None,[(col_ if "decimal" in unseen_df_sp.schema[col_].dataType.simpleString()else '') for col_ in df.columns]))
 
    for col_ in decimal_columns:
          df=df.withColumn(col_, col(col_).cast(DoubleType()))
    return df

Anonymous
Not applicable

@Brijan Elwadhi​ - That's wonderful. Thank you for sharing your solution. 🙂

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.