01-06-2023 01:48 PM
I have a dataframe that inexplicably takes forever to write to an MS SQL Server even though other dataframes, even much larger ones, write nearly instantly.
I'm using this code:
my_dataframe.write.format("jdbc") 
    .option("url",sqlsUrl) 
    .option("driver", "com.microsoft.sqlserver.jdbc.SQLServerDriver") 
    .option("dbtable", table_name )
    .option("user", username) 
    .option("password", password) 
    .save()that'll take hours to run. It's a ~400k row dataset, BABY size. Space on disk is ~150mb. I have dataframes several orders of magnitude larger that write in maybe 15 seconds using the same code.
I repartitioned it using 32 partitions and each node took 4-8 (and one outlier at 25 minutes) minutes to run. for 12,000 rows/5mb of data. This is pandas.to_sql level performance.
This is the schema of the dataframe:
StructField(redacted,TimestampType,true)
StructField(redacted,TimestampType,true)
StructField(redacted,TimestampType,true)
StructField(redacted,DateType,true)
StructField(redacted,IntegerType,true)
StructField(redacted,StringType,true)
StructField(redacted,DateType,true)
StructField(redacted,IntegerType,true)
StructField(redacted,StringType,true)
StructField(redacted,DateType,true)
StructField(redacted,IntegerType,true)
StructField(redacted,StringType,true)
StructField(redacted,DateType,true)
StructField(redacted,IntegerType,true)
StructField(redacted,StringType,true)
StructField(redacted,DateType,true)
StructField(redacted,StringType,true)
StructField(redacted,StringType,true)
StructField(redacted,StringType,true)
StructField(redacted,StringType,true)
StructField(redacted,StringType,true)
StructField(redacted,StringType,true)
StructField(redacted,StringType,true)
StructField(redacted,StringType,true)
StructField(redacted,IntegerType,true)
StructField(redacted,DecimalType(13,2),true)
StructField(redacted,DecimalType(13,2),true)
StructField(redacted,DecimalType(13,2),true)
StructField(redacted,DecimalType(13,2),true)
StructField(redacted,DecimalType(13,2),true)
StructField(redacted,IntegerType,true)
StructField(redacted,StringType,true)
StructField(redacted,IntegerType,true)
StructField(redacted,StringType,true)
StructField(redacted,StringType,true)
StructField(redacted,StringType,true)
StructField(redacted,StringType,true)
StructField(redacted,StringType,true)
StructField(redacted,DecimalType(10,6),true)
StructField(redacted,DecimalType(10,6),true)
StructField(redacted,StringType,true)
StructField(redacted,StringType,true)
StructField(redacted,StringType,true)
StructField(redacted,StringType,true)
StructField(redacted,StringType,true)
StructField(redacted,StringType,true)
StructField(redacted,DecimalType(10,6),true)
StructField(redacted,DecimalType(10,6),true)
StructField(redacted,StringType,true)
StructField(redacted,StringType,true)
StructField(redacted,StringType,true)
StructField(redacted,StringType,true)
StructField(redacted,StringType,true)
StructField(redacted,StringType,true)
StructField(redacted,StringType,true)
StructField(redacted,StringType,true)
StructField(redacted,StringType,true)
StructField(redacted,IntegerType,true)
StructField(redacted,TimestampType,true)and when I write to csv and read it back it all becomes StringType and the write is nearly instant.
On the SQL server these are the data types that it ends up being in the slow query:
datetime
datetime
datetime
date
int
nvarchar
date
int
nvarchar
date
int
nvarchar
date
int
nvarchar
date
nvarchar
nvarchar
nvarchar
nvarchar
nvarchar
nvarchar
nvarchar
nvarchar
int
decimal
decimal
decimal
decimal
decimal
int
nvarchar
int
nvarchar
nvarchar
nvarchar
nvarchar
nvarchar
decimal
decimal
nvarchar
nvarchar
nvarchar
nvarchar
nvarchar
nvarchar
decimal
decimal
nvarchar
nvarchar
nvarchar
nvarchar
nvarchar
nvarchar
nvarchar
nvarchar
nvarchar
int
datetimehelppp
01-06-2023 02:40 PM
You can debug network issues (is a private link used for connection to MS SQL - avoid public internet) and additionally check performance on MS SQL side. For example, I am using Azure MS SQL and can see their overload, network traffic etc. If you append data, maybe there are indexes. I don't know your architecture, so is hard to guess. CSV is done locally by Spark so it is usually pretty fast.
01-07-2023 08:05 AM
right
01-08-2023 07:35 PM
I'll check, but just to clarify about the csv thing, I meant to say that it is faster if I write to csv, then read from csv, then save to the on-site ms sql server after. I'm saying that doing that additional step makes the loading to ms sql server go quicker for some reason.
01-29-2023 10:28 PM
I meet the same problem and I don't know how to write dataFrame to MS sql server quickly
07-24-2024 04:00 AM
Had a similar issue. I can do 1-4 million rows in 1 minute via SSIS ETL on SQL server. Table is 15 fields long. Looking at your code it seems you have many fields but nothing like 300-400 fields which can affect performance. You can check SQL Server Profiler where it gives you the query time for each SQL executed but if the command by databricks issued against SQL Server is not BULKUPLOAD it will take longer for sure.
08-23-2024 08:12 AM
Can you provide an example for us?
 
					
				
				
			
		
 
					
				
				
			
		
Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!
Sign Up Now