cancel
Showing results for 
Search instead for 
Did you mean: 
Get Started Discussions
Start your journey with Databricks by joining discussions on getting started guides, tutorials, and introductory topics. Connect with beginners and experts alike to kickstart your Databricks experience.
cancel
Showing results for 
Search instead for 
Did you mean: 

DLT Pipeline issue - Failed to read dataset .Dataset is not defined in the pipeline.

Dlt
New Contributor III

Background. 

I have created a DLT pipeline in which i am creating a Temorary table.  There are 5 temporary tables as such.  When i executed these in an independent notebook they all worked fine with DLT. 

Now i have merged this notebook ( keeping same exact code) with other bronze layer notebook

Each temporary table is in seperate cell.  But with this consolidated notebook i am getting above mentioned error. 

So NOT sure what is issue here , if a DLT code which worked independently earlier why it would fail when combined with other bronze layer code.  

This is happing due to a feature issue with DLT where we can add multiple notebook but cannot setup a sequence 

Thanks

11 REPLIES 11

BR_DatabricksAI
Contributor

Please uses the workflow and jobs option and associate the respective notebooks with respective to the job in order to enable the sequential process. 

https://docs.databricks.com/en/workflows/jobs/create-run-jobs.html

This functionality is also avaiable for DLT tables as well though I have not used in DLT tables. 

Wojciech_BUK
Valued Contributor III

You probably messed up your code, or alternatively, the runtime has been upgraded, and it stopped working.

DLT is 100% declarative and never runs in any kind of sequence; instead, it is figuring out dependencies between tables and setting the execution DAG (putting code in separate notebooks is just a way of keeping your code clean).

Maybe you can attach your notebook and a screenshot of the error in the DLT pipeline.

There is also another thing you can try:

  1. Create a new DLT pipeline.
  2. Target a new schema.
  3. Put your final bronze notebook.
  4. Run.

If it runs okay, there is a chance that DLT bugged out.

Dlt
New Contributor III

My Notebook is like below and there are 5 such by varying table.  I had this earlier in seperate notebook which ran very well . But now i merged it up with Bronze layer notebook and i ran into issues 

@Dlt.table(
name="Temp Table",
table_properties={"quality" : "silver"},
Temporary=True
)
@Dlt.expect_all(rules)
def Temp_Table():
return (
spark.sql("SELECT * FROM bronze_layer_table")
.withColumn("is_bad_data", expr(quarantine_rules)))

@Dlt.table(
name="Clean Table",
table_properties={"quality" : "silver"}
)
def get_clean_data():
return (
dlt.read("Temp Table")
.filter("is_bad_data=false")
)
@Dlt.table(
name="Bad Data",
table_properties={"quality" : "silver"}
)

def get_bad_data():
return (
dlt.read("Temp Table")
.filter("is_bad_data=true")
)

 

Error 

Failed to resolve flow due to upstream failure. 

Failed to read dataset 'Temp Table'. Dataset is not defined in the pipeline.

Wojciech_BUK
Valued Contributor III

I don;t belive your code was working before, at least the one you pasted above that has spacebars in table names as DLT throw me errors that it could not register it.
I made some correction to code and and it works ok.

I have added underscored "_"  to table names, changed decorators "@Dlt" to "@dlt" and changed "Temporary" to "temporary"

I had to drop your expectations and fake one column with static value.

Wojciech_BUK_0-1703689873573.png

 

 

import dlt

from pyspark.sql.functions import lit

@dlt.table(
name="Temp_Table",
table_properties={"quality" : "silver"},
temporary=True
)
def Temp_Table():
    return (
        spark.sql("SELECT * FROM priv_wojciech_bukowski.dss_gold.dim_dss_date")
        .withColumn("is_bad_data", lit('xxx'))
)


@dlt.table(
name="Clean_Table",
table_properties={"quality" : "silver"}
)
def get_clean_data():
    return (
        dlt.read("Temp_Table")
        .filter("is_bad_data='xxx'")
)

 

Dlt
New Contributor III

Please note code had worked earlier when I was running it via seperate notebook , these errors are just typo

Considering code has no syntax issues what would went wrong with same code when its called below bronze layer notebook to have just one notebook instead of two.

Wojciech_BUK
Valued Contributor III

Maybe you materialized the table and later substitute it with temporary table ( just guess).

There were some changes recently that you have only tables and materialized views only in DLT and they let use legacy syntax , so there is chance e.g. that something run on certain version of DLT pipeline and is not working on new version ( and you don't have control over version) .

Again that is just guess, as I did not saw your code and piepliens before and after changes and artifacts created by DLT.

When you Merge code and pipelines  there is always chance something goes wrong , especially in DLT as you don't have control over objects like in classic approach 😕

I could not replicated your issue with code that you provided. 

Dlt
New Contributor III

Hello 

I dropped all existing objects, deleted old DLT pipeline and creating a new one with same name but same problem is seen. 

Not sure why it complains about Temporary tables those would be created at runtime , even i tried to remove temporay flag but same problems.  Not sure what's wrong here, i am running out of options here. 

 

 

 

Wojciech_BUK
Valued Contributor III

You can attach your notebook with code there.
I did now sawy your code or full trace.

If I were you, I would get exact line of code where you have error and remove that etinre dlt table section and chcek if this will be working.
Then i would add it back trying to resolve the error one by one, maybe you will find pattern.
But you can also attach your code (as file), so somone can import it and help you.. 

Dlt
New Contributor III

Issue is fixed now . I tried using live qualifier for all the tables I used and then it started working

Thanks for all your help

Thanks 

Dlt
New Contributor III

Hello , 

If i refer to above code you created then error is like below pyspark.errors.exceptions.AnalysisException: Failed to read dataset 'Temp_Table'. Dataset is not defined in the pipeline.  for each of 5 Temp tables 

Below is flow at high level for my DLT Pipeline. 

Step1 - 5 Bronze level tables are created and loaded from JSON files

Step2 - 5 Temp tables are created from 5 bronze tables ( created in step 1)  above with Boolean bad flag ( derived)  

Step3 -  5 Clean and 5 Quarantine tables are created by seperating Good & Bad data based on Bad Flag.

Step4 - 5 Gold layer tables are created from 5 clean tables created in Step 3.

Earlier i had separate notebook for each step which worked great. But when i combined all these into one notebook i am running into issues which i am NOT able to understand.  Each table is in separate cell in all steps as such. 

Wojciech_BUK
Valued Contributor III

I am sorry but information you are providing is not helping at all. 
Plase dump your code there.

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group