Databricks

jonathan-dufaul · ‎01-06-2023

I have a dataframe that inexplicably takes forever to write to an MS SQL Server even though other dataframes, even much larger ones, write nearly instantly.

I'm using this code:

my_dataframe.write.format("jdbc") 
    .option("url",sqlsUrl) 
    .option("driver", "com.microsoft.sqlserver.jdbc.SQLServerDriver") 
    .option("dbtable", table_name )
    .option("user", username) 
    .option("password", password) 
    .save()

that'll take hours to run. It's a ~400k row dataset, BABY size. Space on disk is ~150mb. I have dataframes several orders of magnitude larger that write in maybe 15 seconds using the same code.

I repartitioned it using 32 partitions and each node took 4-8 (and one outlier at 25 minutes) minutes to run. for 12,000 rows/5mb of data. This is pandas.to_sql level performance.

This is the schema of the dataframe:

StructField(redacted,TimestampType,true)
StructField(redacted,TimestampType,true)
StructField(redacted,TimestampType,true)
StructField(redacted,DateType,true)
StructField(redacted,IntegerType,true)
StructField(redacted,StringType,true)
StructField(redacted,DateType,true)
StructField(redacted,IntegerType,true)
StructField(redacted,StringType,true)
StructField(redacted,DateType,true)
StructField(redacted,IntegerType,true)
StructField(redacted,StringType,true)
StructField(redacted,DateType,true)
StructField(redacted,IntegerType,true)
StructField(redacted,StringType,true)
StructField(redacted,DateType,true)
StructField(redacted,StringType,true)
StructField(redacted,StringType,true)
StructField(redacted,StringType,true)
StructField(redacted,StringType,true)
StructField(redacted,StringType,true)
StructField(redacted,StringType,true)
StructField(redacted,StringType,true)
StructField(redacted,StringType,true)
StructField(redacted,IntegerType,true)
StructField(redacted,DecimalType(13,2),true)
StructField(redacted,DecimalType(13,2),true)
StructField(redacted,DecimalType(13,2),true)
StructField(redacted,DecimalType(13,2),true)
StructField(redacted,DecimalType(13,2),true)
StructField(redacted,IntegerType,true)
StructField(redacted,StringType,true)
StructField(redacted,IntegerType,true)
StructField(redacted,StringType,true)
StructField(redacted,StringType,true)
StructField(redacted,StringType,true)
StructField(redacted,StringType,true)
StructField(redacted,StringType,true)
StructField(redacted,DecimalType(10,6),true)
StructField(redacted,DecimalType(10,6),true)
StructField(redacted,StringType,true)
StructField(redacted,StringType,true)
StructField(redacted,StringType,true)
StructField(redacted,StringType,true)
StructField(redacted,StringType,true)
StructField(redacted,StringType,true)
StructField(redacted,DecimalType(10,6),true)
StructField(redacted,DecimalType(10,6),true)
StructField(redacted,StringType,true)
StructField(redacted,StringType,true)
StructField(redacted,StringType,true)
StructField(redacted,StringType,true)
StructField(redacted,StringType,true)
StructField(redacted,StringType,true)
StructField(redacted,StringType,true)
StructField(redacted,StringType,true)
StructField(redacted,StringType,true)
StructField(redacted,IntegerType,true)
StructField(redacted,TimestampType,true)

and when I write to csv and read it back it all becomes StringType and the write is nearly instant.

On the SQL server these are the data types that it ends up being in the slow query:

datetime
datetime
datetime
date
int
nvarchar
date
int
nvarchar
date
int
nvarchar
date
int
nvarchar
date
nvarchar
nvarchar
nvarchar
nvarchar
nvarchar
nvarchar
nvarchar
nvarchar
int
decimal
decimal
decimal
decimal
decimal
int
nvarchar
int
nvarchar
nvarchar
nvarchar
nvarchar
nvarchar
decimal
decimal
nvarchar
nvarchar
nvarchar
nvarchar
nvarchar
nvarchar
decimal
decimal
nvarchar
nvarchar
nvarchar
nvarchar
nvarchar
nvarchar
nvarchar
nvarchar
nvarchar
int
datetime

helppp

Hubert-Dudek · ‎01-06-2023

You can debug network issues (is a private link used for connection to MS SQL - avoid public internet) and additionally check performance on MS SQL side. For example, I am using Azure MS SQL and can see their overload, network traffic etc. If you append data, maybe there are indexes. I don't know your architecture, so is hard to guess. CSV is done locally by Spark so it is usually pretty fast.

Aviral-Bhardwaj · ‎01-07-2023

right

jonathan-dufaul · ‎01-08-2023

I'll check, but just to clarify about the csv thing, I meant to say that it is faster if I write to csv, then read from csv, then save to the on-site ms sql server after. I'm saying that doing that additional step makes the loading to ms sql server go quicker for some reason.

yueyue_tang · ‎01-29-2023

I meet the same problem and I don't know how to write dataFrame to MS sql server quickly

Databricks

Why is writing to MSSQL Server 12.0 so slow directly from spark but nearly instant when I write to a csv and read it back

Registration now open! Databricks Data + AI Summit 2024

Meet DBRX, the New Standard for High-Quality LLMs

Data Warehousing in the Era of AI