cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

DLT: Create Empty Table If Autloader Fails

DaPo
New Contributor II

Hi all,

I am quite new to databricks. Overall I enjoyed the experience so far, but now ran into a problem, I was not able to find an acceptable solution.

Here is my setup: I have a bunch of s3 buckets, and need to put the data into databricks, preferably using a DLT pipeline. Each of the buckets contains a directory with data files, and possibly a directory with error logs, produced by an application. So the structure looks something like this:

  • s3://<bucket-name>/data/...
  • s3://<bucket-name>/errors/...

Creating a DLT-Pipeline to read the data-files was not a problem. Now I wanted to do the same for the errors. The problem is: Some of the buckets do not contain the /errors directory. Therefor the straight forward

 

 

 

@Dlt.table
def error_table():
    return spark.readStream.options(**options).load("s3://<bucket-name>/errors/")

 

 

 

 does not work, and the complete pipeline will fail on initialization. Thus even the other tables in the pipeline will not update. I want to do the following: If the path does not exist, I instead want to create an empty table (the schema is known).

What I tried so far:

 

 

 

@Dlt.table
def error_table():
    try:
        return spark.readStream.options(**options).load("s3://<bucket-name>/errors")
    except Exception:
        return spark.createDataframe(data=[], schema=error_schema)

 

 

 

Due to the lazy nature of spark, this does not work (at least for me).

 

 

 

@Dlt.table
def error_table():
    return spark.readStream.options(**options).option("pathGlobFilter", "/errors*").load("s3://<bucket-name>")

 

 

 

This seems to work, but it seems to scan all the data-files. That is quite inefficient, and not acceptable. For me, this took >30min, even when there were no errors.

0 REPLIES 0

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group