cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Databricks snowflake dataframe.toPandas() taking more space and time

brij
New Contributor III

I have 2 exactly same table(rows and schema). One table recides in AZSQL server data base and other one is in snowflake database. Now we have some existing code which we want to migrate from azsql to snowflake but when we are trying to create a pandas dataframe from the snowflake its takes 20 to 30 mins to do it(3.8 millon records) and in AZSQL same task, we are able to do it in 5 mins. Also to perfome toPandas for AZSQL we were using really small single node cluster and for snowflake we have to use a cluster with 56 gb of memory(otherwise we get OOM error).

1 ACCEPTED SOLUTION

Accepted Solutions

brij
New Contributor III

I was able to find the issue. it was happening because, python doesn't understand spark's decimal data type so, spark treats it as object and that's why rest of the python code was taking so much of time.

And after explicitly type casting it its even faster ๐Ÿ˜Ž ๐Ÿ˜ƒ . thanks guys for input.

here is the code, i am using to handle all the decimal column, Let us know if there is a better way :

def fix_decimal_type(df):
    decimal_columns = list(filter(None,[(col_ if "decimal" in unseen_df_sp.schema[col_].dataType.simpleString()else '') for col_ in df.columns]))
 
    for col_ in decimal_columns:
          df=df.withColumn(col_, col(col_).cast(DoubleType()))
    return df

View solution in original post

7 REPLIES 7

brij
New Contributor III

Hi @Kaniz Fatmaโ€‹ , Thanks a lot, I know this could be because of how snowflake stores the data but also wanted to check here if someone have faced it.

Ryan_Chynoweth
Esteemed Contributor

Just so I understand your issue, can you provide some more information to these questions?

  • You are reading the same data from both Azure SQL and Snowflake into a Spark Dataframe. At that point you are converting the spark dataframe to a Pandas dataframe using the `toPandas()` method?
  • The process described above is taking 4-6 times longer when using Snowflake than when compared to Azure SQL?
  • If the answer to the two questions above is "yes", then I would be curious if the number of partitions for both Dataframes are the same, as that could impact the process.
  • If the answer to the two questions above is "yes", then it is likely the performance of reading data from the databases which is taking so long. This is likely an issue on the Snowflake side not performing as expected when reading data.

My last question, is why are you converting 3.8 million rows to a pandas Dataframe? I would recommend keeping it as a Spark Dataframe as Pandas does start to reach it limitations of processing speed when rows reach the low millions.

brij
New Contributor III

Hi Ryan, Answer to the first 2 question is yes. and partitions for the AZSQL is 1 and for snowflake is 46. Also It's not only about time, with snowflake, We can see a spike in memory consumption as well(in general). While working with snowflake we require more memory, i don't know why? And that's what I am trying to understand. Regarding your suggestion for keeping it in spark data frame, Yes we are keeping it as our last option because it will require lot of code changes, we already have code which works fine with AZSQL(3.8 million records.

thanks for your reply.

Brijan

brij
New Contributor III

I have found one issue with snowflake and this might be related, it could be because of how snowflake stores all of its numeric data type.

https://community.snowflake.com/s/question/0D50Z00008ZddGDSAZ/we-snowflake-convert-smallint-int-data...

Ryan_Chynoweth
Esteemed Contributor

Are you reading the entire table from Snowflake into a Dataframe? I would test to see how fast the query runs in Snowflake natively. If you are reading the entire table then select * would do.

This could also be a performance hit from the Spark connector for Snowflake. This connector was built by Snowflake and it works great but you could be seeing a throttle by it.

It seems to me that it is a Snowflake problem because your code for Azure SQL is performant.

brij
New Contributor III

I was able to find the issue. it was happening because, python doesn't understand spark's decimal data type so, spark treats it as object and that's why rest of the python code was taking so much of time.

And after explicitly type casting it its even faster ๐Ÿ˜Ž ๐Ÿ˜ƒ . thanks guys for input.

here is the code, i am using to handle all the decimal column, Let us know if there is a better way :

def fix_decimal_type(df):
    decimal_columns = list(filter(None,[(col_ if "decimal" in unseen_df_sp.schema[col_].dataType.simpleString()else '') for col_ in df.columns]))
 
    for col_ in decimal_columns:
          df=df.withColumn(col_, col(col_).cast(DoubleType()))
    return df

Anonymous
Not applicable

@Brijan Elwadhiโ€‹ - That's wonderful. Thank you for sharing your solution. ๐Ÿ™‚

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโ€™t want to miss the chance to attend and share knowledge.

If there isnโ€™t a group near you, start one and help create a community that brings people together.

Request a New Group