Hi @rt-slowth,
If the amount of data that is being loaded from Amazon Redshift to Databricks decreases unexpectedly, the cause of the issue could be related to a change in the data size in a Redshift table accessed by Spark decreases unexpectedly, it could be due to several reasons.
Here are some possible causes you can investigate:
-
Data updates or deletions: Check if there was any data that was updated or removed in Redshift. A change in the Redshift data source can cause a decrease in the size of the data exposed to the Spark cluster.
-
Limitations of the Redshift Spectrum: Check if the table in Redshift is a Spectrum external table, as there are limitations on the Spectrum format that may affect the amount of data that is available to Spark. For example, in the Parquet data format, the data size may significantly decrease if the data is heavily compressed or has a high number of empty or null values.
-
Query optimizations: Check if there has been any optimizations in the Spark queries accessing the Redshift data that may have resulted in a reduction of the amount of data pulled from Redshift.
To check the variables in a widget or code of each execution, you can use the Databricks notebook context to access the dbutils.widgets
and dbutils.notebook.entry_point.getDbutils()
functions.
- Using dbutils.widgets: Use
dbutils.widgets
to create and set widget variables. For example, to set a widget variable myVar
to a value myValue
, you can use the following code:
dbutils.widgets.text("myVar", "myValue")