Databricks Community

brij · ‎09-23-2021

I have 2 exactly same table(rows and schema). One table recides in AZSQL server data base and other one is in snowflake database. Now we have some existing code which we want to migrate from azsql to snowflake but when we are trying to create a pandas dataframe from the snowflake its takes 20 to 30 mins to do it(3.8 millon records) and in AZSQL same task, we are able to do it in 5 mins. Also to perfome toPandas for AZSQL we were using really small single node cluster and for snowflake we have to use a cluster with 56 gb of memory(otherwise we get OOM error).

brij · ‎10-01-2021

I was able to find the issue. it was happening because, python doesn't understand spark's decimal data type so, spark treats it as object and that's why rest of the python code was taking so much of time.

And after explicitly type casting it its even faster 😎 😃 . thanks guys for input.

here is the code, i am using to handle all the decimal column, Let us know if there is a better way :

def fix_decimal_type(df):
    decimal_columns = list(filter(None,[(col_ if "decimal" in unseen_df_sp.schema[col_].dataType.simpleString()else '') for col_ in df.columns]))
 
    for col_ in decimal_columns:
          df=df.withColumn(col_, col(col_).cast(DoubleType()))
    return df

View solution in original post

brij · ‎09-23-2021

Hi @Kaniz Fatma , Thanks a lot, I know this could be because of how snowflake stores the data but also wanted to check here if someone have faced it.

Ryan_Chynoweth · ‎09-23-2021

Just so I understand your issue, can you provide some more information to these questions?

You are reading the same data from both Azure SQL and Snowflake into a Spark Dataframe. At that point you are converting the spark dataframe to a Pandas dataframe using the `toPandas()` method?
The process described above is taking 4-6 times longer when using Snowflake than when compared to Azure SQL?
If the answer to the two questions above is "yes", then I would be curious if the number of partitions for both Dataframes are the same, as that could impact the process.
If the answer to the two questions above is "yes", then it is likely the performance of reading data from the databases which is taking so long. This is likely an issue on the Snowflake side not performing as expected when reading data.

My last question, is why are you converting 3.8 million rows to a pandas Dataframe? I would recommend keeping it as a Spark Dataframe as Pandas does start to reach it limitations of processing speed when rows reach the low millions.

brij · ‎09-23-2021

Hi Ryan, Answer to the first 2 question is yes. and partitions for the AZSQL is 1 and for snowflake is 46. Also It's not only about time, with snowflake, We can see a spike in memory consumption as well(in general). While working with snowflake we require more memory, i don't know why? And that's what I am trying to understand. Regarding your suggestion for keeping it in spark data frame, Yes we are keeping it as our last option because it will require lot of code changes, we already have code which works fine with AZSQL(3.8 million records.

thanks for your reply.

Brijan

brij · ‎09-27-2021

I have found one issue with snowflake and this might be related, it could be because of how snowflake stores all of its numeric data type.

https://community.snowflake.com/s/question/0D50Z00008ZddGDSAZ/we-snowflake-convert-smallint-int-data...

Ryan_Chynoweth · ‎09-27-2021

Are you reading the entire table from Snowflake into a Dataframe? I would test to see how fast the query runs in Snowflake natively. If you are reading the entire table then select * would do.

This could also be a performance hit from the Spark connector for Snowflake. This connector was built by Snowflake and it works great but you could be seeing a throttle by it.

It seems to me that it is a Snowflake problem because your code for Azure SQL is performant.

brij · ‎10-01-2021

I was able to find the issue. it was happening because, python doesn't understand spark's decimal data type so, spark treats it as object and that's why rest of the python code was taking so much of time.

And after explicitly type casting it its even faster 😎 😃 . thanks guys for input.

here is the code, i am using to handle all the decimal column, Let us know if there is a better way :

def fix_decimal_type(df):
    decimal_columns = list(filter(None,[(col_ if "decimal" in unseen_df_sp.schema[col_].dataType.simpleString()else '') for col_ in df.columns]))
 
    for col_ in decimal_columns:
          df=df.withColumn(col_, col(col_).cast(DoubleType()))
    return df