cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Why is writing to MSSQL Server 12.0 so slow directly from spark but nearly instant when I write to a csv and read it back

jonathan-dufaul
Valued Contributor

I have a dataframe that inexplicably takes forever to write to an MS SQL Server even though other dataframes, even much larger ones, write nearly instantly.

I'm using this code:

my_dataframe.write.format("jdbc") 
    .option("url",sqlsUrl) 
    .option("driver", "com.microsoft.sqlserver.jdbc.SQLServerDriver") 
    .option("dbtable", table_name )
    .option("user", username) 
    .option("password", password) 
    .save()

that'll take hours to run. It's a ~400k row dataset, BABY size. Space on disk is ~150mb. I have dataframes several orders of magnitude larger that write in maybe 15 seconds using the same code.

I repartitioned it using 32 partitions and each node took 4-8 (and one outlier at 25 minutes) minutes to run. for 12,000 rows/5mb of data. This is pandas.to_sql level performance.

This is the schema of the dataframe:

StructField(redacted,TimestampType,true)
StructField(redacted,TimestampType,true)
StructField(redacted,TimestampType,true)
StructField(redacted,DateType,true)
StructField(redacted,IntegerType,true)
StructField(redacted,StringType,true)
StructField(redacted,DateType,true)
StructField(redacted,IntegerType,true)
StructField(redacted,StringType,true)
StructField(redacted,DateType,true)
StructField(redacted,IntegerType,true)
StructField(redacted,StringType,true)
StructField(redacted,DateType,true)
StructField(redacted,IntegerType,true)
StructField(redacted,StringType,true)
StructField(redacted,DateType,true)
StructField(redacted,StringType,true)
StructField(redacted,StringType,true)
StructField(redacted,StringType,true)
StructField(redacted,StringType,true)
StructField(redacted,StringType,true)
StructField(redacted,StringType,true)
StructField(redacted,StringType,true)
StructField(redacted,StringType,true)
StructField(redacted,IntegerType,true)
StructField(redacted,DecimalType(13,2),true)
StructField(redacted,DecimalType(13,2),true)
StructField(redacted,DecimalType(13,2),true)
StructField(redacted,DecimalType(13,2),true)
StructField(redacted,DecimalType(13,2),true)
StructField(redacted,IntegerType,true)
StructField(redacted,StringType,true)
StructField(redacted,IntegerType,true)
StructField(redacted,StringType,true)
StructField(redacted,StringType,true)
StructField(redacted,StringType,true)
StructField(redacted,StringType,true)
StructField(redacted,StringType,true)
StructField(redacted,DecimalType(10,6),true)
StructField(redacted,DecimalType(10,6),true)
StructField(redacted,StringType,true)
StructField(redacted,StringType,true)
StructField(redacted,StringType,true)
StructField(redacted,StringType,true)
StructField(redacted,StringType,true)
StructField(redacted,StringType,true)
StructField(redacted,DecimalType(10,6),true)
StructField(redacted,DecimalType(10,6),true)
StructField(redacted,StringType,true)
StructField(redacted,StringType,true)
StructField(redacted,StringType,true)
StructField(redacted,StringType,true)
StructField(redacted,StringType,true)
StructField(redacted,StringType,true)
StructField(redacted,StringType,true)
StructField(redacted,StringType,true)
StructField(redacted,StringType,true)
StructField(redacted,IntegerType,true)
StructField(redacted,TimestampType,true)

and when I write to csv and read it back it all becomes StringType and the write is nearly instant.

On the SQL server these are the data types that it ends up being in the slow query:

datetime
datetime
datetime
date
int
nvarchar
date
int
nvarchar
date
int
nvarchar
date
int
nvarchar
date
nvarchar
nvarchar
nvarchar
nvarchar
nvarchar
nvarchar
nvarchar
nvarchar
int
decimal
decimal
decimal
decimal
decimal
int
nvarchar
int
nvarchar
nvarchar
nvarchar
nvarchar
nvarchar
decimal
decimal
nvarchar
nvarchar
nvarchar
nvarchar
nvarchar
nvarchar
decimal
decimal
nvarchar
nvarchar
nvarchar
nvarchar
nvarchar
nvarchar
nvarchar
nvarchar
nvarchar
int
datetime

helppp

4 REPLIES 4

Hubert-Dudek
Esteemed Contributor III

You can debug network issues (is a private link used for connection to MS SQL - avoid public internet) and additionally check performance on MS SQL side. For example, I am using Azure MS SQL and can see their overload, network traffic etc. If you append data, maybe there are indexes. I don't know your architecture, so is hard to guess. CSV is done locally by Spark so it is usually pretty fast.

Aviral-Bhardwaj
Esteemed Contributor III

right

I'll check, but just to clarify about the csv thing, I meant to say that it is faster if I write to csv, then read from csv, then save to the on-site ms sql server after. I'm saying that doing that additional step makes the loading to ms sql server go quicker for some reason.

yueyue_tang
New Contributor II

I meet the same problem and I don't know how to write dataFrame to MS sql server quickly​

Join 100K+ Data Experts: Register Now & Grow with Us!

Excited to expand your horizons with us? Click here to Register and begin your journey to success!

Already a member? Login and join your local regional user group! If there isn’t one near you, fill out this form and we’ll create one for you to join!