cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

Why is writing to MSSQL Server 12.0 so slow directly from spark but nearly instant when I write to a csv and read it back

jonathan-dufaul
Valued Contributor

I have a dataframe that inexplicably takes forever to write to an MS SQL Server even though other dataframes, even much larger ones, write nearly instantly.

I'm using this code:

my_dataframe.write.format("jdbc") 
    .option("url",sqlsUrl) 
    .option("driver", "com.microsoft.sqlserver.jdbc.SQLServerDriver") 
    .option("dbtable", table_name )
    .option("user", username) 
    .option("password", password) 
    .save()

that'll take hours to run. It's a ~400k row dataset, BABY size. Space on disk is ~150mb. I have dataframes several orders of magnitude larger that write in maybe 15 seconds using the same code.

I repartitioned it using 32 partitions and each node took 4-8 (and one outlier at 25 minutes) minutes to run. for 12,000 rows/5mb of data. This is pandas.to_sql level performance.

This is the schema of the dataframe:

StructField(redacted,TimestampType,true)
StructField(redacted,TimestampType,true)
StructField(redacted,TimestampType,true)
StructField(redacted,DateType,true)
StructField(redacted,IntegerType,true)
StructField(redacted,StringType,true)
StructField(redacted,DateType,true)
StructField(redacted,IntegerType,true)
StructField(redacted,StringType,true)
StructField(redacted,DateType,true)
StructField(redacted,IntegerType,true)
StructField(redacted,StringType,true)
StructField(redacted,DateType,true)
StructField(redacted,IntegerType,true)
StructField(redacted,StringType,true)
StructField(redacted,DateType,true)
StructField(redacted,StringType,true)
StructField(redacted,StringType,true)
StructField(redacted,StringType,true)
StructField(redacted,StringType,true)
StructField(redacted,StringType,true)
StructField(redacted,StringType,true)
StructField(redacted,StringType,true)
StructField(redacted,StringType,true)
StructField(redacted,IntegerType,true)
StructField(redacted,DecimalType(13,2),true)
StructField(redacted,DecimalType(13,2),true)
StructField(redacted,DecimalType(13,2),true)
StructField(redacted,DecimalType(13,2),true)
StructField(redacted,DecimalType(13,2),true)
StructField(redacted,IntegerType,true)
StructField(redacted,StringType,true)
StructField(redacted,IntegerType,true)
StructField(redacted,StringType,true)
StructField(redacted,StringType,true)
StructField(redacted,StringType,true)
StructField(redacted,StringType,true)
StructField(redacted,StringType,true)
StructField(redacted,DecimalType(10,6),true)
StructField(redacted,DecimalType(10,6),true)
StructField(redacted,StringType,true)
StructField(redacted,StringType,true)
StructField(redacted,StringType,true)
StructField(redacted,StringType,true)
StructField(redacted,StringType,true)
StructField(redacted,StringType,true)
StructField(redacted,DecimalType(10,6),true)
StructField(redacted,DecimalType(10,6),true)
StructField(redacted,StringType,true)
StructField(redacted,StringType,true)
StructField(redacted,StringType,true)
StructField(redacted,StringType,true)
StructField(redacted,StringType,true)
StructField(redacted,StringType,true)
StructField(redacted,StringType,true)
StructField(redacted,StringType,true)
StructField(redacted,StringType,true)
StructField(redacted,IntegerType,true)
StructField(redacted,TimestampType,true)

and when I write to csv and read it back it all becomes StringType and the write is nearly instant.

On the SQL server these are the data types that it ends up being in the slow query:

datetime
datetime
datetime
date
int
nvarchar
date
int
nvarchar
date
int
nvarchar
date
int
nvarchar
date
nvarchar
nvarchar
nvarchar
nvarchar
nvarchar
nvarchar
nvarchar
nvarchar
int
decimal
decimal
decimal
decimal
decimal
int
nvarchar
int
nvarchar
nvarchar
nvarchar
nvarchar
nvarchar
decimal
decimal
nvarchar
nvarchar
nvarchar
nvarchar
nvarchar
nvarchar
decimal
decimal
nvarchar
nvarchar
nvarchar
nvarchar
nvarchar
nvarchar
nvarchar
nvarchar
nvarchar
int
datetime

helppp

4 REPLIES 4

Hubert-Dudek
Esteemed Contributor III

You can debug network issues (is a private link used for connection to MS SQL - avoid public internet) and additionally check performance on MS SQL side. For example, I am using Azure MS SQL and can see their overload, network traffic etc. If you append data, maybe there are indexes. I don't know your architecture, so is hard to guess. CSV is done locally by Spark so it is usually pretty fast.

Aviral-Bhardwaj
Esteemed Contributor III

right

I'll check, but just to clarify about the csv thing, I meant to say that it is faster if I write to csv, then read from csv, then save to the on-site ms sql server after. I'm saying that doing that additional step makes the loading to ms sql server go quicker for some reason.

yueyue_tang
New Contributor II

I meet the same problem and I don't know how to write dataFrame to MS sql server quickly​

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.