<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: How to print out logs during DLT pipeline run in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/how-to-print-out-logs-during-dlt-pipeline-run/m-p/120957#M46289</link>
    <description>&lt;P&gt;We can try emitting logs to stdout/stderr:&amp;nbsp;&lt;/P&gt;&lt;P&gt;The below sample code worked in UC dlt cluster -&lt;FONT color="#0000FF"&gt;&lt;EM&gt;&amp;nbsp;dlt:16.4.0-delta-pipelines-photon-dlt-release-dp-2025.20-rc0-commit-fcedf0a-image-be34de2&lt;/EM&gt;&lt;/FONT&gt;&lt;/P&gt;&lt;LI-CODE lang="python"&gt;import dlt
from pyspark.sql.functions import col
from utilities import utils
import logging
import sys
from pyspark.sql.functions import expr

# Configure Python logging to stdout
logger = logging.getLogger("DLTLogger")
logger.setLevel(logging.INFO)

handler = logging.StreamHandler(sys.stdout) # Change to sys.stderr for stderr
formatter = logging.Formatter("%(asctime)s - %(levelname)s - %(message)s")
handler.setFormatter(formatter)

# Avoid duplicate handlers if rerun in notebook
if not logger.handlers:
  logger.addHandler(handler)

@dlt.table
def sample_trips_dlt_logging_test():
  logger.info("dlt_logging## Reading sample trips data from Delta table.")
  df = spark.read.table("samples.nyctaxi.trips")

  logger.info(f"dlt_logging## Schema of the sample trips data: {df.schema.simpleString()}")
  logger.info(f"dlt_logging## Number of rows read: {df.count()}")

  df = df.withColumn("trip_distance_km", utils.distance_km(col("trip_distance")))
  logger.info("dlt_logging## Added trip_distance_km column to the sample trips data.")

  return df&lt;/LI-CODE&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
    <pubDate>Wed, 04 Jun 2025 17:14:25 GMT</pubDate>
    <dc:creator>User16871418122</dc:creator>
    <dc:date>2025-06-04T17:14:25Z</dc:date>
    <item>
      <title>How to print out logs during DLT pipeline run</title>
      <link>https://community.databricks.com/t5/data-engineering/how-to-print-out-logs-during-dlt-pipeline-run/m-p/82303#M36603</link>
      <description>&lt;P&gt;I'm trying to debug my pipeline in DLT and during runtime I need some log info and how do I do a print('something') during DLT run?&lt;/P&gt;</description>
      <pubDate>Thu, 08 Aug 2024 05:03:15 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/how-to-print-out-logs-during-dlt-pipeline-run/m-p/82303#M36603</guid>
      <dc:creator>ruoyuqian</dc:creator>
      <dc:date>2024-08-08T05:03:15Z</dc:date>
    </item>
    <item>
      <title>Re: How to print out logs during DLT pipeline run</title>
      <link>https://community.databricks.com/t5/data-engineering/how-to-print-out-logs-during-dlt-pipeline-run/m-p/89178#M37716</link>
      <description>&lt;P&gt;I have the same question. This will help the debug process.&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Mon, 09 Sep 2024 12:26:16 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/how-to-print-out-logs-during-dlt-pipeline-run/m-p/89178#M37716</guid>
      <dc:creator>kranthi2</dc:creator>
      <dc:date>2024-09-09T12:26:16Z</dc:date>
    </item>
    <item>
      <title>Re: How to print out logs during DLT pipeline run</title>
      <link>https://community.databricks.com/t5/data-engineering/how-to-print-out-logs-during-dlt-pipeline-run/m-p/89383#M37776</link>
      <description>&lt;P&gt;Hi&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/114079"&gt;@ruoyuqian&lt;/a&gt;&amp;nbsp;,&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/11193"&gt;@kranthi2&lt;/a&gt;,&lt;/P&gt;&lt;H3&gt;&lt;FONT size="4"&gt;Why print() Statements Won’t Work in DLT:&lt;/FONT&gt;&lt;/H3&gt;&lt;P&gt;In Databricks Delta Live Tables (DLT), using print()&amp;nbsp;statements for logging does not work as expected. This is because DLT runs as a managed pipeline, and the execution environment differs from regular Databricks notebooks. Output from print() statements is not captured and displayed in the same way, making it ineffective for debugging during pipeline runs.&lt;/P&gt;&lt;H3&gt;&lt;FONT size="4"&gt;Alternative Solution: Using Log4j to log to Driver Log&lt;/FONT&gt;&lt;/H3&gt;&lt;P&gt;To log information during a DLT pipeline run, you can use the logging&amp;nbsp;library and configure it to log to the driver logs. Here is an example of how you can set up logging within a DLT pipeline to log to the driver logs:&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;LI-CODE lang="python"&gt;import dlt
import logging
from pyspark.sql.functions import col

# Set up logging configuration
log4jLogger = spark._jvm.org.apache.log4j
logger = log4jLogger.LogManager.getLogger(__name__)

@dlt.table(
    comment="This is the raw data from the sample source table."
)
def read_source_data():
    # Log the start of reading data
    logger.info("Reading data from the source table.")
    
    # Read data from the source table
    df = spark.table("sample_source")
    
    # Log the schema and number of rows read
    logger.info(f"Schema of the source table: {df.schema.simpleString()}")
    logger.info(f"Number of rows read: {df.count()}")
    
    return df

@dlt.table(
    comment="This table contains transformed data."
)
def transform_data():
    logger.info("Transforming data from the source table.")
    
    # Read the raw data and apply a transformation
    df = dlt.read("read_source_data").withColumn("value_doubled", col("value") * 2)
    
    # Log transformation completion
    logger.info(f"Transformation completed. Output schema: {df.schema.simpleString()}")
    
    return df&lt;/LI-CODE&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;After running the DLT pipeline navigate to driver log:&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="filipniziol_0-1726005018859.png" style="width: 400px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/11058i92579C6FC8E438DD/image-size/medium?v=v2&amp;amp;px=400" role="button" title="filipniziol_0-1726005018859.png" alt="filipniziol_0-1726005018859.png" /&gt;&lt;/span&gt;&lt;/P&gt;&lt;P&gt;Download the log file:&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="filipniziol_1-1726005083772.png" style="width: 400px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/11059iFE1A1CDCEEBCD32E/image-size/medium?v=v2&amp;amp;px=400" role="button" title="filipniziol_1-1726005083772.png" alt="filipniziol_1-1726005083772.png" /&gt;&lt;/span&gt;&lt;/P&gt;&lt;P&gt;You can search log messages by filtering by "INFO __main__:":&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="filipniziol_2-1726005297744.png" style="width: 400px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/11060iB6248219762A1479/image-size/medium?v=v2&amp;amp;px=400" role="button" title="filipniziol_2-1726005297744.png" alt="filipniziol_2-1726005297744.png" /&gt;&lt;/span&gt;&lt;/P&gt;&lt;H3&gt;&lt;FONT size="4"&gt;Logging to Cloud Storage:&lt;/FONT&gt;&lt;/H3&gt;&lt;P&gt;For more persistent or remote access to logs, you can configure the logger to write directly to a cloud storage location such as AWS S3, Azure Blob Storage, or Google Cloud Storage. This can be useful for capturing logs in a centralized location, especially when dealing with production pipelines.&lt;BR /&gt;&lt;BR /&gt;You need to have a connection to the cloud storage, and then to add handler to the logger. The code would look like this:&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;LI-CODE lang="python"&gt;import logging
from azure.storage.blob import BlobServiceClient
from io import StringIO

# Azure Storage configuration
storage_account_name = "my_storage_account"
container_name = "logs"
blob_name = "dlt-logs.log"
connection_string = "DefaultEndpointsProtocol=https;AccountName=my_storage_account;AccountKey=&amp;lt;your-storage-account-key&amp;gt;;EndpointSuffix=core.windows.net"

# Initialize BlobServiceClient 
blob_client = BlobServiceClient.from_connection_string(connection_string).get_blob_client(container=container_name, blob=blob_name)

# Log handler
class AzureBlobHandler(logging.Handler):
    def __init__(self, blob_client):
        super().__init__()
        self.blob_client = blob_client

    def emit(self, record):
        msg = self.format(record) + "\n"
        # Upload the log message to Azure Blob Storage
        self.blob_client.upload_blob(msg, overwrite=True)

# Configure the logger
logger = logging.getLogger("DLTLogger")
logger.setLevel(logging.INFO)
azure_blob_handler = AzureBlobHandler(blob_client)
azure_blob_handler.setFormatter(logging.Formatter('%(asctime)s - %(levelname)s - %(message)s'))
logger.addHandler(azure_blob_handler)

# Example usage remains the same
logger.info("This is an info message logged to Azure Blob Storage.")&lt;/LI-CODE&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Tue, 10 Sep 2024 22:06:40 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/how-to-print-out-logs-during-dlt-pipeline-run/m-p/89383#M37776</guid>
      <dc:creator>filipniziol</dc:creator>
      <dc:date>2024-09-10T22:06:40Z</dc:date>
    </item>
    <item>
      <title>Re: How to print out logs during DLT pipeline run</title>
      <link>https://community.databricks.com/t5/data-engineering/how-to-print-out-logs-during-dlt-pipeline-run/m-p/89389#M37778</link>
      <description>&lt;P&gt;Hi&amp;nbsp;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;A href="https://community.databricks.com/t5/user/viewprofilepage/user-id/114079" target="_blank" rel="noopener"&gt;@ruoyuqian&lt;/A&gt;&amp;nbsp;,&amp;nbsp;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;A href="https://community.databricks.com/t5/user/viewprofilepage/user-id/11193" target="_blank" rel="noopener"&gt;@kranthi2&lt;/A&gt;,&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;Why print() Statements Won’t Work in DLT:&lt;/STRONG&gt;&lt;/P&gt;&lt;P&gt;In Databricks Delta Live Tables (DLT), you do not see print() statements, as what is visible are the events.&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;Alternative Solution: Using Log4j to log to Driver Log&lt;/STRONG&gt;&lt;/P&gt;&lt;P&gt;To log information during a DLT pipeline run, you can use the logging&amp;nbsp;library and configure it to log to the driver logs. Here is an example of how you can set up logging within a DLT pipeline to log to the driver logs:&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;LI-CODE lang="python"&gt;import dlt
import logging
from pyspark.sql.functions import col

# Set up logging configuration
log4jLogger = spark._jvm.org.apache.log4j
logger = log4jLogger.LogManager.getLogger(__name__)

@dlt.table(
    comment="This is the raw data from the sample source table."
)
def read_source_data():
    # Log the start of reading data
    logger.info("Reading data from the source table.")
    
    # Read data from the source table
    df = spark.table("sample_source")
    
    # Log the schema and number of rows read
    logger.info(f"Schema of the source table: {df.schema.simpleString()}")
    logger.info(f"Number of rows read: {df.count()}")
    
    return df

@dlt.table(
    comment="This table contains transformed data."
)
def transform_data():
    logger.info("Transforming data from the source table.")
    
    # Read the raw data and apply a transformation
    df = dlt.read("read_source_data").withColumn("value_doubled", col("value") * 2)
    
    # Log transformation completion
    logger.info(f"Transformation completed. Output schema: {df.schema.simpleString()}")
    
    return df&lt;/LI-CODE&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;After running the DLT pipeline navigate to driver log and download the log file:&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="filipniziol_0-1726007640812.png" style="width: 400px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/11062i2B7D38B03F04B6A4/image-size/medium?v=v2&amp;amp;px=400" role="button" title="filipniziol_0-1726007640812.png" alt="filipniziol_0-1726007640812.png" /&gt;&lt;/span&gt;&lt;/P&gt;&lt;P&gt;You can search log messages by filtering by "INFO __main__:":&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="filipniziol_1-1726007686193.png" style="width: 400px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/11063i838595B74B79C199/image-size/medium?v=v2&amp;amp;px=400" role="button" title="filipniziol_1-1726007686193.png" alt="filipniziol_1-1726007686193.png" /&gt;&lt;/span&gt;&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;Logging to Cloud Storage:&lt;/STRONG&gt;&lt;/P&gt;&lt;P&gt;For more persistent or remote access to logs, you can configure the logger to write directly to a cloud storage location such as Azure Blob Storage.&lt;BR /&gt;&lt;BR /&gt;You need to have a connection to the cloud storage, and then to add handler to the logger. The code would look like this:&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;LI-CODE lang="python"&gt;import logging
from azure.storage.blob import BlobServiceClient
from io import StringIO

# Azure Storage configuration
storage_account_name = "my_storage_account"
container_name = "logs"
blob_name = "dlt-logs.log"
connection_string = "DefaultEndpointsProtocol=https;AccountName=my_storage_account;AccountKey=&amp;lt;your-storage-account-key&amp;gt;;EndpointSuffix=core.windows.net"

# BlobServiceClient with fewer steps
blob_client = BlobServiceClient.from_connection_string(connection_string).get_blob_client(container=container_name, blob=blob_name)

# Custom log handler
class AzureBlobHandler(logging.Handler):
    def __init__(self, blob_client):
        super().__init__()
        self.blob_client = blob_client

    def emit(self, record):
        msg = self.format(record) + "\n"
        # Directly upload the log message to Azure Blob Storage
        self.blob_client.upload_blob(msg, overwrite=True)

# Configure the logger
logger = logging.getLogger("DLTLogger")&lt;/LI-CODE&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Tue, 10 Sep 2024 22:39:57 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/how-to-print-out-logs-during-dlt-pipeline-run/m-p/89389#M37778</guid>
      <dc:creator>filipniziol</dc:creator>
      <dc:date>2024-09-10T22:39:57Z</dc:date>
    </item>
    <item>
      <title>Re: How to print out logs during DLT pipeline run</title>
      <link>https://community.databricks.com/t5/data-engineering/how-to-print-out-logs-during-dlt-pipeline-run/m-p/113917#M44678</link>
      <description>&lt;P&gt;&lt;SPAN&gt;&amp;gt;&amp;gt; LogManager.getLogger() seems is not working in DLT notebook&amp;nbsp;&lt;BR /&gt;&lt;BR /&gt;DLTError: [PY4J_BLOCKED_API] You are using a Python API that is not supported in the current environment. Please check Databricks documentation for alternatives. An error occurred while calling z:org.apache.log4j.LogManager.getLogger&lt;/SPAN&gt;&lt;/P&gt;</description>
      <pubDate>Fri, 28 Mar 2025 15:51:34 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/how-to-print-out-logs-during-dlt-pipeline-run/m-p/113917#M44678</guid>
      <dc:creator>iooj</dc:creator>
      <dc:date>2025-03-28T15:51:34Z</dc:date>
    </item>
    <item>
      <title>Re: How to print out logs during DLT pipeline run</title>
      <link>https://community.databricks.com/t5/data-engineering/how-to-print-out-logs-during-dlt-pipeline-run/m-p/120448#M46169</link>
      <description>&lt;P&gt;Can confirm what &lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/155712"&gt;@iooj&lt;/a&gt; found. It appears for me as well. Using serverless DLT version &lt;SPAN class=""&gt;dlt:16.1.8-delta-pipelines-photon-dlt-release-dp-2025.20-rc0-commit-fcedf0a-image-8aadc5c&lt;/SPAN&gt; . This did work on non-serverless for an older version of DLT. Perhaps&amp;nbsp;Databricks has another way? I'll post here if I find something from support.&lt;/P&gt;</description>
      <pubDate>Wed, 28 May 2025 15:21:18 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/how-to-print-out-logs-during-dlt-pipeline-run/m-p/120448#M46169</guid>
      <dc:creator>_DatabricksUser</dc:creator>
      <dc:date>2025-05-28T15:21:18Z</dc:date>
    </item>
    <item>
      <title>Re: How to print out logs during DLT pipeline run</title>
      <link>https://community.databricks.com/t5/data-engineering/how-to-print-out-logs-during-dlt-pipeline-run/m-p/120957#M46289</link>
      <description>&lt;P&gt;We can try emitting logs to stdout/stderr:&amp;nbsp;&lt;/P&gt;&lt;P&gt;The below sample code worked in UC dlt cluster -&lt;FONT color="#0000FF"&gt;&lt;EM&gt;&amp;nbsp;dlt:16.4.0-delta-pipelines-photon-dlt-release-dp-2025.20-rc0-commit-fcedf0a-image-be34de2&lt;/EM&gt;&lt;/FONT&gt;&lt;/P&gt;&lt;LI-CODE lang="python"&gt;import dlt
from pyspark.sql.functions import col
from utilities import utils
import logging
import sys
from pyspark.sql.functions import expr

# Configure Python logging to stdout
logger = logging.getLogger("DLTLogger")
logger.setLevel(logging.INFO)

handler = logging.StreamHandler(sys.stdout) # Change to sys.stderr for stderr
formatter = logging.Formatter("%(asctime)s - %(levelname)s - %(message)s")
handler.setFormatter(formatter)

# Avoid duplicate handlers if rerun in notebook
if not logger.handlers:
  logger.addHandler(handler)

@dlt.table
def sample_trips_dlt_logging_test():
  logger.info("dlt_logging## Reading sample trips data from Delta table.")
  df = spark.read.table("samples.nyctaxi.trips")

  logger.info(f"dlt_logging## Schema of the sample trips data: {df.schema.simpleString()}")
  logger.info(f"dlt_logging## Number of rows read: {df.count()}")

  df = df.withColumn("trip_distance_km", utils.distance_km(col("trip_distance")))
  logger.info("dlt_logging## Added trip_distance_km column to the sample trips data.")

  return df&lt;/LI-CODE&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Wed, 04 Jun 2025 17:14:25 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/how-to-print-out-logs-during-dlt-pipeline-run/m-p/120957#M46289</guid>
      <dc:creator>User16871418122</dc:creator>
      <dc:date>2025-06-04T17:14:25Z</dc:date>
    </item>
    <item>
      <title>Re: How to print out logs during DLT pipeline run</title>
      <link>https://community.databricks.com/t5/data-engineering/how-to-print-out-logs-during-dlt-pipeline-run/m-p/123786#M47067</link>
      <description>&lt;P&gt;Can confirm what &lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/29860"&gt;@User16871418122&lt;/a&gt; reported. That was what Databricks support recommended. There are two caveats with it:&lt;BR /&gt;&lt;BR /&gt;1. Logs will be emitted twice. One time due to lazy validation and second time for execution.&lt;BR /&gt;2. Logging will not necessarily continue being emitted on subsequent executions. This may likely be the case in streaming tables per Databricks engineering. This implies that this logging would via this way is only affective for debugging initial code and not necessarily for the long-term.&lt;BR /&gt;&lt;BR /&gt;The work-around to the above is the use event hooks (&lt;A href="https://docs.databricks.com/aws/en/dlt/event-hooks" target="_blank"&gt;https://docs.databricks.com/aws/en/dlt/event-hooks&lt;/A&gt;). Read the docs but from what I'm seeing it does come with its own caveats that may be more impactful for debug logging:&lt;BR /&gt;&lt;BR /&gt;1. Event hooks run asynchronous to the DLT pipeline execution. Suggestion by databricks is to include execution timestamps in the logs to assist with correlating pipelines events with logging.&lt;BR /&gt;2. The event hooks will only log so long as the DLT cluster is running. In other words, if the DLT cluster finishes before the event hooks finish, the event hook will be prematurely terminated. No work around for this was provided.&lt;/P&gt;</description>
      <pubDate>Wed, 02 Jul 2025 20:10:41 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/how-to-print-out-logs-during-dlt-pipeline-run/m-p/123786#M47067</guid>
      <dc:creator>_DatabricksUser</dc:creator>
      <dc:date>2025-07-02T20:10:41Z</dc:date>
    </item>
  </channel>
</rss>

