โ12-27-2023 01:42 AM
Background.
I have created a DLT pipeline in which i am creating a Temorary table. There are 5 temporary tables as such. When i executed these in an independent notebook they all worked fine with DLT.
Now i have merged this notebook ( keeping same exact code) with other bronze layer notebook
Each temporary table is in seperate cell. But with this consolidated notebook i am getting above mentioned error.
So NOT sure what is issue here , if a DLT code which worked independently earlier why it would fail when combined with other bronze layer code.
This is happing due to a feature issue with DLT where we can add multiple notebook but cannot setup a sequence
Thanks
โ12-27-2023 05:01 AM
Please uses the workflow and jobs option and associate the respective notebooks with respective to the job in order to enable the sequential process.
https://docs.databricks.com/en/workflows/jobs/create-run-jobs.html
This functionality is also avaiable for DLT tables as well though I have not used in DLT tables.
โ12-27-2023 05:09 AM
You probably messed up your code, or alternatively, the runtime has been upgraded, and it stopped working.
DLT is 100% declarative and never runs in any kind of sequence; instead, it is figuring out dependencies between tables and setting the execution DAG (putting code in separate notebooks is just a way of keeping your code clean).
Maybe you can attach your notebook and a screenshot of the error in the DLT pipeline.
There is also another thing you can try:
If it runs okay, there is a chance that DLT bugged out.
โ12-27-2023 05:54 AM
My Notebook is like below and there are 5 such by varying table. I had this earlier in seperate notebook which ran very well . But now i merged it up with Bronze layer notebook and i ran into issues
@Dlt.table(
name="Temp Table",
table_properties={"quality" : "silver"},
Temporary=True
)
@Dlt.expect_all(rules)
def Temp_Table():
return (
spark.sql("SELECT * FROM bronze_layer_table")
.withColumn("is_bad_data", expr(quarantine_rules)))
@Dlt.table(
name="Clean Table",
table_properties={"quality" : "silver"}
)
def get_clean_data():
return (
dlt.read("Temp Table")
.filter("is_bad_data=false")
)
@Dlt.table(
name="Bad Data",
table_properties={"quality" : "silver"}
)
def get_bad_data():
return (
dlt.read("Temp Table")
.filter("is_bad_data=true")
)
Error
Failed to resolve flow due to upstream failure.
Failed to read dataset 'Temp Table'. Dataset is not defined in the pipeline.
โ12-27-2023 07:12 AM
I don;t belive your code was working before, at least the one you pasted above that has spacebars in table names as DLT throw me errors that it could not register it.
I made some correction to code and and it works ok.
I have added underscored "_" to table names, changed decorators "@Dlt" to "@dlt" and changed "Temporary" to "temporary"
I had to drop your expectations and fake one column with static value.
import dlt
from pyspark.sql.functions import lit
@dlt.table(
name="Temp_Table",
table_properties={"quality" : "silver"},
temporary=True
)
def Temp_Table():
return (
spark.sql("SELECT * FROM priv_wojciech_bukowski.dss_gold.dim_dss_date")
.withColumn("is_bad_data", lit('xxx'))
)
@dlt.table(
name="Clean_Table",
table_properties={"quality" : "silver"}
)
def get_clean_data():
return (
dlt.read("Temp_Table")
.filter("is_bad_data='xxx'")
)
โ12-27-2023 09:42 AM
Please note code had worked earlier when I was running it via seperate notebook , these errors are just typo
Considering code has no syntax issues what would went wrong with same code when its called below bronze layer notebook to have just one notebook instead of two.
โ12-27-2023 09:57 AM
Maybe you materialized the table and later substitute it with temporary table ( just guess).
There were some changes recently that you have only tables and materialized views only in DLT and they let use legacy syntax , so there is chance e.g. that something run on certain version of DLT pipeline and is not working on new version ( and you don't have control over version) .
Again that is just guess, as I did not saw your code and piepliens before and after changes and artifacts created by DLT.
When you Merge code and pipelines there is always chance something goes wrong , especially in DLT as you don't have control over objects like in classic approach ๐
I could not replicated your issue with code that you provided.
โ12-28-2023 12:29 AM
Hello
I dropped all existing objects, deleted old DLT pipeline and creating a new one with same name but same problem is seen.
Not sure why it complains about Temporary tables those would be created at runtime , even i tried to remove temporay flag but same problems. Not sure what's wrong here, i am running out of options here.
โ12-28-2023 01:06 AM
You can attach your notebook with code there.
I did now sawy your code or full trace.
If I were you, I would get exact line of code where you have error and remove that etinre dlt table section and chcek if this will be working.
Then i would add it back trying to resolve the error one by one, maybe you will find pattern.
But you can also attach your code (as file), so somone can import it and help you..
โ01-02-2024 12:36 AM
Issue is fixed now . I tried using live qualifier for all the tables I used and then it started working
Thanks for all your help
Thanks
โ12-28-2023 01:53 AM
Hello ,
If i refer to above code you created then error is like below pyspark.errors.exceptions.AnalysisException: Failed to read dataset 'Temp_Table'. Dataset is not defined in the pipeline. for each of 5 Temp tables
Below is flow at high level for my DLT Pipeline.
Step1 - 5 Bronze level tables are created and loaded from JSON files
Step2 - 5 Temp tables are created from 5 bronze tables ( created in step 1) above with Boolean bad flag ( derived)
Step3 - 5 Clean and 5 Quarantine tables are created by seperating Good & Bad data based on Bad Flag.
Step4 - 5 Gold layer tables are created from 5 clean tables created in Step 3.
Earlier i had separate notebook for each step which worked great. But when i combined all these into one notebook i am running into issues which i am NOT able to understand. Each table is in separate cell in all steps as such.
โ12-28-2023 02:51 AM
I am sorry but information you are providing is not helping at all.
Plase dump your code there.
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโt want to miss the chance to attend and share knowledge.
If there isnโt a group near you, start one and help create a community that brings people together.
Request a New Group