cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Why is writing to MSSQL Server 12.0 so slow directly from spark but nearly instant when I write to a csv and read it back

jonathan-dufaul
Valued Contributor

I have a dataframe that inexplicably takes forever to write to an MS SQL Server even though other dataframes, even much larger ones, write nearly instantly.

I'm using this code:

my_dataframe.write.format("jdbc") 
    .option("url",sqlsUrl) 
    .option("driver", "com.microsoft.sqlserver.jdbc.SQLServerDriver") 
    .option("dbtable", table_name )
    .option("user", username) 
    .option("password", password) 
    .save()

that'll take hours to run. It's a ~400k row dataset, BABY size. Space on disk is ~150mb. I have dataframes several orders of magnitude larger that write in maybe 15 seconds using the same code.

I repartitioned it using 32 partitions and each node took 4-8 (and one outlier at 25 minutes) minutes to run. for 12,000 rows/5mb of data. This is pandas.to_sql level performance.

This is the schema of the dataframe:

StructField(redacted,TimestampType,true)
StructField(redacted,TimestampType,true)
StructField(redacted,TimestampType,true)
StructField(redacted,DateType,true)
StructField(redacted,IntegerType,true)
StructField(redacted,StringType,true)
StructField(redacted,DateType,true)
StructField(redacted,IntegerType,true)
StructField(redacted,StringType,true)
StructField(redacted,DateType,true)
StructField(redacted,IntegerType,true)
StructField(redacted,StringType,true)
StructField(redacted,DateType,true)
StructField(redacted,IntegerType,true)
StructField(redacted,StringType,true)
StructField(redacted,DateType,true)
StructField(redacted,StringType,true)
StructField(redacted,StringType,true)
StructField(redacted,StringType,true)
StructField(redacted,StringType,true)
StructField(redacted,StringType,true)
StructField(redacted,StringType,true)
StructField(redacted,StringType,true)
StructField(redacted,StringType,true)
StructField(redacted,IntegerType,true)
StructField(redacted,DecimalType(13,2),true)
StructField(redacted,DecimalType(13,2),true)
StructField(redacted,DecimalType(13,2),true)
StructField(redacted,DecimalType(13,2),true)
StructField(redacted,DecimalType(13,2),true)
StructField(redacted,IntegerType,true)
StructField(redacted,StringType,true)
StructField(redacted,IntegerType,true)
StructField(redacted,StringType,true)
StructField(redacted,StringType,true)
StructField(redacted,StringType,true)
StructField(redacted,StringType,true)
StructField(redacted,StringType,true)
StructField(redacted,DecimalType(10,6),true)
StructField(redacted,DecimalType(10,6),true)
StructField(redacted,StringType,true)
StructField(redacted,StringType,true)
StructField(redacted,StringType,true)
StructField(redacted,StringType,true)
StructField(redacted,StringType,true)
StructField(redacted,StringType,true)
StructField(redacted,DecimalType(10,6),true)
StructField(redacted,DecimalType(10,6),true)
StructField(redacted,StringType,true)
StructField(redacted,StringType,true)
StructField(redacted,StringType,true)
StructField(redacted,StringType,true)
StructField(redacted,StringType,true)
StructField(redacted,StringType,true)
StructField(redacted,StringType,true)
StructField(redacted,StringType,true)
StructField(redacted,StringType,true)
StructField(redacted,IntegerType,true)
StructField(redacted,TimestampType,true)

and when I write to csv and read it back it all becomes StringType and the write is nearly instant.

On the SQL server these are the data types that it ends up being in the slow query:

datetime
datetime
datetime
date
int
nvarchar
date
int
nvarchar
date
int
nvarchar
date
int
nvarchar
date
nvarchar
nvarchar
nvarchar
nvarchar
nvarchar
nvarchar
nvarchar
nvarchar
int
decimal
decimal
decimal
decimal
decimal
int
nvarchar
int
nvarchar
nvarchar
nvarchar
nvarchar
nvarchar
decimal
decimal
nvarchar
nvarchar
nvarchar
nvarchar
nvarchar
nvarchar
decimal
decimal
nvarchar
nvarchar
nvarchar
nvarchar
nvarchar
nvarchar
nvarchar
nvarchar
nvarchar
int
datetime

helppp

6 REPLIES 6

Hubert-Dudek
Esteemed Contributor III

You can debug network issues (is a private link used for connection to MS SQL - avoid public internet) and additionally check performance on MS SQL side. For example, I am using Azure MS SQL and can see their overload, network traffic etc. If you append data, maybe there are indexes. I don't know your architecture, so is hard to guess. CSV is done locally by Spark so it is usually pretty fast.

Aviral-Bhardwaj
Esteemed Contributor III

right

AviralBhardwaj

I'll check, but just to clarify about the csv thing, I meant to say that it is faster if I write to csv, then read from csv, then save to the on-site ms sql server after. I'm saying that doing that additional step makes the loading to ms sql server go quicker for some reason.

yueyue_tang
New Contributor II

I meet the same problem and I don't know how to write dataFrame to MS sql server quickly​

plondon
New Contributor II

Had a similar issue. I can do 1-4 million rows in 1 minute via SSIS ETL on SQL server. Table is 15 fields long. Looking at your code it seems you have many fields but nothing like 300-400 fields which can affect performance. You can check SQL Server Profiler where it gives you the query time for each SQL executed but if the command by databricks issued against SQL Server is not BULKUPLOAD it will take longer for sure. 

Can you provide an example for us?

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group