Databricks Community

Madalian · ‎03-13-2024

Hi Experts,

I have on Bronze layer all delta merge files (Parquet) format.

I am converting these files into delta live tables in silver layer. While doing so, I am unable to add current time stamp column.

Following is the script:

from pyspark.sql.functions import *
from pyspark.sql.types import *
import dlt
currentdate= spark.sql("select string(current_timestamp)").collect()[0][0]
@Dlt.table
def t_stream_silver_00_XXXX():
return (
spark.readStream
.format("delta")
.option("cloudFiles.format", "parquet")
.option(key = "SILVER_LOADING_TIME", value =currentdate)
.load("abfss://bronze-prod-XXXXX@adlsg2SAPPPPpsi001.dfs.core.windows.net/PROD/XXXX/Delta/00_XXXX/")
)

it didt not throw any error at the same time no column being added. Can you please help?

Kaniz · ‎03-14-2024

Hi @Madalian, Let’s address the issue you’re facing while converting your Parquet files from the Bronze layer to Delta live tables in the Silver layer.

Column Addition: It appears that you’re trying to add a current timestamp column named SILVER_LOADING_TIME to your Delta table. However, despite no errors being thrown, the column is not being added as expected.
Best Practices for Bronze Layer: Before we dive into the specific issue, let’s discuss some best practices related to the Bronze layer in the Medallion architecture:
- Metadata Capture: It’s a good practice to include metadata information, such as ingestion timestamp and source file names, in the Bronze tables.
- Data Types and Structure: The Bronze layer should ideally maintain the same data types and structure as the source system table. This allows for efficient re-reading of data at the Bronze layer without hitting the source table again.
- Delta Lake Format: Storing data in Delta Lake format provides benefits like performance optimization...¹ ².
Troubleshooting the Issue: Let’s focus on resolving the issue by adding the timestamp column. Here are a few steps to consider:
- Check the Data: First, verify that your Parquet files in the Bronze layer indeed contain the data you expect. Ensure that the relevant timestamp information is present.
- Column Name Case Sensitivity: Delta Lake is case-sensitive when it comes to column names. Make sure that the column name you’re using (SILVER_LOADING_TIME) matches the actual column name in your Parquet files.
- Column Data Type: Confirm that the data type of the SILVER_LOADING_TIME column aligns with your expectations (e.g., timestamp or string).
Debugging Approach: To debug further, consider the following steps:
- Inspect Schema: Load the Parquet files directly (outside the Delta table creation) and inspect the schema. Ensure that the timestamp column is present and has the correct data type.
- Explicit Column Addition: Instead of relying on the option(key = "SILVER_LOADING_TIME", value = currentdate) approach, explicitly add the timestamp column using withColumn during DataFrame creation.

Example Code Modification: Here’s an adjusted version of your script that explicitly adds the timestamp column:

from pyspark.sql.functions import *
from pyspark.sql.types import *
import dlt

currentdate = spark.sql("select current_timestamp").collect()[0][0]

@Dlt.table
def t_stream_silver_00_XXXX():
    return (
        spark.readStream
            .format("delta")
            .option("cloudFiles.format", "parquet")
            .load("abfss://bronze-prod-XXXXX@adlsg2SAPPPPpsi001.dfs.core.windows.net/PROD/XXXX/Delta/00_XXXX/")
            .withColumn("SILVER_LOADING_TIME", currentdate)
    )

Further Investigation: If the issue persists, consider checking the logs or monitoring the execution to identify any additional clues. Also, ensure that the Delta Lake version you’re using supports the features you’re trying to utilize.