โ09-23-2021 02:11 AM
I have 2 exactly same table(rows and schema). One table recides in AZSQL server data base and other one is in snowflake database. Now we have some existing code which we want to migrate from azsql to snowflake but when we are trying to create a pandas dataframe from the snowflake its takes 20 to 30 mins to do it(3.8 millon records) and in AZSQL same task, we are able to do it in 5 mins. Also to perfome toPandas for AZSQL we were using really small single node cluster and for snowflake we have to use a cluster with 56 gb of memory(otherwise we get OOM error).
โ10-01-2021 03:48 AM
I was able to find the issue. it was happening because, python doesn't understand spark's decimal data type so, spark treats it as object and that's why rest of the python code was taking so much of time.
And after explicitly type casting it its even faster ๐ ๐ . thanks guys for input.
here is the code, i am using to handle all the decimal column, Let us know if there is a better way :
def fix_decimal_type(df):
decimal_columns = list(filter(None,[(col_ if "decimal" in unseen_df_sp.schema[col_].dataType.simpleString()else '') for col_ in df.columns]))
for col_ in decimal_columns:
df=df.withColumn(col_, col(col_).cast(DoubleType()))
return df
โ09-23-2021 02:22 AM
Hi @Kaniz Fatmaโ , Thanks a lot, I know this could be because of how snowflake stores the data but also wanted to check here if someone have faced it.
โ09-23-2021 08:53 AM
Just so I understand your issue, can you provide some more information to these questions?
My last question, is why are you converting 3.8 million rows to a pandas Dataframe? I would recommend keeping it as a Spark Dataframe as Pandas does start to reach it limitations of processing speed when rows reach the low millions.
โ09-23-2021 09:10 AM
Hi Ryan, Answer to the first 2 question is yes. and partitions for the AZSQL is 1 and for snowflake is 46. Also It's not only about time, with snowflake, We can see a spike in memory consumption as well(in general). While working with snowflake we require more memory, i don't know why? And that's what I am trying to understand. Regarding your suggestion for keeping it in spark data frame, Yes we are keeping it as our last option because it will require lot of code changes, we already have code which works fine with AZSQL(3.8 million records.
thanks for your reply.
Brijan
โ09-27-2021 12:41 AM
I have found one issue with snowflake and this might be related, it could be because of how snowflake stores all of its numeric data type.
โ09-27-2021 11:33 AM
Are you reading the entire table from Snowflake into a Dataframe? I would test to see how fast the query runs in Snowflake natively. If you are reading the entire table then select * would do.
This could also be a performance hit from the Spark connector for Snowflake. This connector was built by Snowflake and it works great but you could be seeing a throttle by it.
It seems to me that it is a Snowflake problem because your code for Azure SQL is performant.
โ10-01-2021 03:48 AM
I was able to find the issue. it was happening because, python doesn't understand spark's decimal data type so, spark treats it as object and that's why rest of the python code was taking so much of time.
And after explicitly type casting it its even faster ๐ ๐ . thanks guys for input.
here is the code, i am using to handle all the decimal column, Let us know if there is a better way :
def fix_decimal_type(df):
decimal_columns = list(filter(None,[(col_ if "decimal" in unseen_df_sp.schema[col_].dataType.simpleString()else '') for col_ in df.columns]))
for col_ in decimal_columns:
df=df.withColumn(col_, col(col_).cast(DoubleType()))
return df
โ10-01-2021 12:17 PM
@Brijan Elwadhiโ - That's wonderful. Thank you for sharing your solution. ๐
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโt want to miss the chance to attend and share knowledge.
If there isnโt a group near you, start one and help create a community that brings people together.
Request a New Group