cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

DLT and Modularity (best practices?)

jeremy1
New Contributor II

I have [very] recently started using DLT for the first time. One of the challenges I have run into is how to include other "modules" within my pipelines. I missed the documentation where magic commands (with the exception of %pip) are ignored and was unpleasantly surprised when running the workflow for the first time.

What is the best practice for including common modules within workflows?

In my particular case, what I would like to do is create a separate module that can dynamically generate a dict of expectations given a specific table... and I definitely do not want to include this is all of my notebooks (DRY). Any ideas/suggestions/best-pratices for a newbie on how to accomplish this?

Thanks for the help and guidance!

10 REPLIES 10

User16764241763
Honored Contributor

Hello @Jeremy Colsonโ€‹ Thank you for reaching out to Databricks Community Forum.

Could you please give this a try if you already have a Repos linked in the workspace?

I think Engineering is working on some improvements on this front.

https://docs.databricks.com/repos/index.html

image 

Below code snippet shows a simple example. You can implement your own logic and try to import it in the DLT pipeline.

import sys
import pprint
 
sys.path.append("/Workspace/Repos/arvind.ravish@databricks.com/arvindravish/dlt_import")
 
from my_file import myClass
newClass = myClass(5)
val = newClass.getVal()
print(val * 5)
 

Please provide your feedback so we can add any improvements to our product backlog.

Hi Arvind,

I did the configuration is per your description but still it fails.

Here are my screenshots:

imageimageimage 

Can you please suggest how to proceed?

Best regards

Ruben

@Ruben Hartensteinโ€‹ notice the difference in the icons in your screenshot (they are notebooks) vs. the icons in the Arvind's post. You need to use this menu option to create an arbitrary file, not a notebook:

 image.png

hardy1982
New Contributor II

Hi @Greg Gallowayโ€‹ ,

thank you very much for your reply. I did according to your suggestion but now I face another error when executing the pipeline.

Can you please advice?

Best regards

Ruben

@Ruben Hartensteinโ€‹  I don't see any @dlt.table mentions in your code. I'm assuming that error means the pipeline evaluated your code and didn't find any DLT tables in it. Maybe study a few of the samples as a template?

hardy1982
New Contributor II

@Greg Gallowayโ€‹ 

Is there no way to run the pipeline without @dlt?

I just want to use hive tables in my coding.

If you don't want any Delta Live Tables, then just use the Jobs tab under the Workflows tab. Or there are plenty of other ways of just running a notebook in whatever orchestration tool you use (e.g. Azure Data Factory, etc.) @Ruben Hartensteinโ€‹ 

Kaniz_Fatma
Community Manager
Community Manager

Hi @Jeremy Colsonโ€‹, We havenโ€™t heard from you on the last response from @Arvind Ravishโ€‹, and I was checking back to see if you have a resolution yet. If you have any solution, please share it with the community as it can be helpful to others. Otherwise, we will respond with more details and try to help.

Greg_Galloway
New Contributor III

I like the approach @Arvind Ravishโ€‹ shared since you can't currently use %run in DLT pipelines. However, it took a little testing to be clear on how exactly to make it work.

First, ensure in the Admin Console that the repos feature is configured as follows:

image.pngThen create a new arbitrary file named Import.py using the following menu option. (Note, it does not work with a Notebook.)

create arbitrary fileThe file should contain code like the following:

MYVAR1 = "hi"
MYVAR2 = 99
MYVAR3 = "hello"
 
def factorial(num):
    fact=1
    for i in range(1,num+1):
        fact = fact*i
    return fact

In the DLT notebook, the following code loads Import.py and executes the Python code in it. Then MYVAR1, MYVAR2, MYVAR3, and the factorial function will be available for reference downstream in the pipeline.

import pyspark.sql.functions as f
 
txt = spark.read.text("file:/Workspace/Repos/FolderName/RepoName/Import.py") 
 
#concatenate all lines of the file into a single string
singlerow = txt.agg(f.concat_ws("\r\n", f.collect_list(txt.value)))
data = "\r\n".join(singlerow.collect()[0])
 
#execute that string of python
exec(data)

This appears to work in both Current and Preview channel DLT pipelines at the moment.

Unfortunately, the os.getcwd() command doesn't appear to be working in DLT pipelines (as it returns /databricks/driver even when the DLT pipeline notebook is in a Repo) so I haven't figured out a way to use a relative path even if your calling notebook is also in Repos. The following currently fails and Azure support case 2211240040000106 has been opened:

import os
import pyspark.sql.functions as f
 
txt = spark.read.text(f"file:{os.getcwd()}/Import.py") 
 
#concatenate all lines of the file into a single string
singlerow = txt.agg(f.concat_ws("\r\n", f.collect_list(txt.value)))
data = "\r\n".join(singlerow.collect()[0])
 
#execute that string of python
exec(data)

I'm also having trouble using the import example from aravish without using a hardcoded path like sys.path.append("/Workspace/Repos/TopFolder/RepoName") when running in a DLT pipeline. The aravish approach is useful if you want to import function definitions but not execute any Python code and not define any variables which will be visible in the calling notebook's Spark session.

Note: Edited from a previous post where I made a few mistakes.

What my setup looks like is two workspaces, one dev, one prod. The repos folder where dlt pipelines run from is called dev in dev and prod in prod. I use secret scopes to retrieve the appropriate text to be able to path to the correct environments sys.path.append(some/path/to/my.py) so I can from my.py import method/classโ€‹.

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโ€™t want to miss the chance to attend and share knowledge.

If there isnโ€™t a group near you, start one and help create a community that brings people together.

Request a New Group