- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
โ09-23-2021 02:11 AM
I have 2 exactly same table(rows and schema). One table recides in AZSQL server data base and other one is in snowflake database. Now we have some existing code which we want to migrate from azsql to snowflake but when we are trying to create a pandas dataframe from the snowflake its takes 20 to 30 mins to do it(3.8 millon records) and in AZSQL same task, we are able to do it in 5 mins. Also to perfome toPandas for AZSQL we were using really small single node cluster and for snowflake we have to use a cluster with 56 gb of memory(otherwise we get OOM error).
- Labels:
-
Pandas dataframe
-
Pyspark
-
Space
-
Time
Accepted Solutions
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
โ10-01-2021 03:48 AM
I was able to find the issue. it was happening because, python doesn't understand spark's decimal data type so, spark treats it as object and that's why rest of the python code was taking so much of time.
And after explicitly type casting it its even faster ๐ ๐ . thanks guys for input.
here is the code, i am using to handle all the decimal column, Let us know if there is a better way :
def fix_decimal_type(df):
decimal_columns = list(filter(None,[(col_ if "decimal" in unseen_df_sp.schema[col_].dataType.simpleString()else '') for col_ in df.columns]))
for col_ in decimal_columns:
df=df.withColumn(col_, col(col_).cast(DoubleType()))
return df
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
โ09-23-2021 02:22 AM
Hi @Kaniz Fatmaโ , Thanks a lot, I know this could be because of how snowflake stores the data but also wanted to check here if someone have faced it.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
โ09-23-2021 08:53 AM
Just so I understand your issue, can you provide some more information to these questions?
- You are reading the same data from both Azure SQL and Snowflake into a Spark Dataframe. At that point you are converting the spark dataframe to a Pandas dataframe using the `toPandas()` method?
- The process described above is taking 4-6 times longer when using Snowflake than when compared to Azure SQL?
- If the answer to the two questions above is "yes", then I would be curious if the number of partitions for both Dataframes are the same, as that could impact the process.
- If the answer to the two questions above is "yes", then it is likely the performance of reading data from the databases which is taking so long. This is likely an issue on the Snowflake side not performing as expected when reading data.
My last question, is why are you converting 3.8 million rows to a pandas Dataframe? I would recommend keeping it as a Spark Dataframe as Pandas does start to reach it limitations of processing speed when rows reach the low millions.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
โ09-23-2021 09:10 AM
Hi Ryan, Answer to the first 2 question is yes. and partitions for the AZSQL is 1 and for snowflake is 46. Also It's not only about time, with snowflake, We can see a spike in memory consumption as well(in general). While working with snowflake we require more memory, i don't know why? And that's what I am trying to understand. Regarding your suggestion for keeping it in spark data frame, Yes we are keeping it as our last option because it will require lot of code changes, we already have code which works fine with AZSQL(3.8 million records.
thanks for your reply.
Brijan
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
โ09-27-2021 12:41 AM
I have found one issue with snowflake and this might be related, it could be because of how snowflake stores all of its numeric data type.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
โ09-27-2021 11:33 AM
Are you reading the entire table from Snowflake into a Dataframe? I would test to see how fast the query runs in Snowflake natively. If you are reading the entire table then select * would do.
This could also be a performance hit from the Spark connector for Snowflake. This connector was built by Snowflake and it works great but you could be seeing a throttle by it.
It seems to me that it is a Snowflake problem because your code for Azure SQL is performant.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
โ10-01-2021 03:48 AM
I was able to find the issue. it was happening because, python doesn't understand spark's decimal data type so, spark treats it as object and that's why rest of the python code was taking so much of time.
And after explicitly type casting it its even faster ๐ ๐ . thanks guys for input.
here is the code, i am using to handle all the decimal column, Let us know if there is a better way :
def fix_decimal_type(df):
decimal_columns = list(filter(None,[(col_ if "decimal" in unseen_df_sp.schema[col_].dataType.simpleString()else '') for col_ in df.columns]))
for col_ in decimal_columns:
df=df.withColumn(col_, col(col_).cast(DoubleType()))
return df
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
โ10-01-2021 12:17 PM
@Brijan Elwadhiโ - That's wonderful. Thank you for sharing your solution. ๐