cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

Silent failure in DataFrameWriter when loading data to Redshift

kishorekumar
New Contributor

Context:

I'm using DataFrameWriter to load the dataSet into the Redshift. DataFrameWriter writes the dataSet to S3, and loads data from S3 to Redshift by issuing the Redshift copy command.

Issue:

In frequently we are observing, the data is present in the S3 but not loaded to the Redshift by DataFrameWriter and during this situations DataFrameWriter.save() operation is succeeding without any error.

The latest DataFrameWriter documentation says that there could be missing or incomplete data due to S3’s eventual consistency in listing operations. Even though S3’s official document says that eventual consistency happens only for bucket deletions and for versioned buckets, I don’t have more information on how this is relevant to the claim made by DataFrameWriter that data could go missing. 

Documentations blurb

DataFrameWriter Documentation (https://docs.databricks.com/external-data/amazon-redshift.html): “When reading from and writing to Redshift, the data source reads and writes data in S3. Both Spark and Redshift produce partitioned output and store it in multiple files in S3. According to the Amazon S3 Data Consistency Model documentation, S3 bucket listing operations are eventually-consistent, so the files must to go to special lengths to avoid missing or incomplete data due to this source of eventual-consistency.”

S3: S3 Data Consistency Model documentation (https://docs.aws.amazon.com/AmazonS3/latest/dev/Introduction.html#ConsistencyModel)

Ask:

  1. Is this the expected behaviour of the DataFrameWriter that even though DataFrameWriter.save() operation succeeded without any error, are there any chances data not loaded to Redshift?
  2. How to deal with this situation?

Code:

// dataframe is a obj of Dataset<Row>.
DataFrameWriter<Row> writer = dataframe.write()
        .mode(SaveMode.Append)
        .format("io.github.spark_redshift_community.spark.redshift")
        .option("url", this.config.getUrl())
        .option("user", username)
        .option("password", password)
        .option("tempdir", this.config.getTempS3Dir())
        .option("aws_iam_role", iamRole)
        .option("ssl", "true")
        .option("sslfactory", "org.redshift.ssl.NonValidatingFactory")
        .option("tempformat", DEFAULT_TEMP_FORMAT)
        .option("dbtable", tableName);
writer.save();

1 REPLY 1

Anonymous
Not applicable

Hi @Kishorekumar Somasundaram​ 

Great to meet you, and thanks for your question!

Let's see if your peers in the community have an answer to your question. Thanks.

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.