Databricks

jeremy1 · ‎05-17-2022

I have [very] recently started using DLT for the first time. One of the challenges I have run into is how to include other "modules" within my pipelines. I missed the documentation where magic commands (with the exception of %pip) are ignored and was unpleasantly surprised when running the workflow for the first time.

What is the best practice for including common modules within workflows?

In my particular case, what I would like to do is create a separate module that can dynamically generate a dict of expectations given a specific table... and I definitely do not want to include this is all of my notebooks (DRY). Any ideas/suggestions/best-pratices for a newbie on how to accomplish this?

Thanks for the help and guidance!

User16764241763 · ‎06-11-2022

Hello @Jeremy Colson Thank you for reaching out to Databricks Community Forum.

Could you please give this a try if you already have a Repos linked in the workspace?

I think Engineering is working on some improvements on this front.

https://docs.databricks.com/repos/index.html

Below code snippet shows a simple example. You can implement your own logic and try to import it in the DLT pipeline.

import sys
import pprint
 
sys.path.append("/Workspace/Repos/arvind.ravish@databricks.com/arvindravish/dlt_import")
 
from my_file import myClass
newClass = myClass(5)
val = newClass.getVal()
print(val * 5)

Please provide your feedback so we can add any improvements to our product backlog.

hardy1982 · ‎01-05-2023

Hi Arvind,

I did the configuration is per your description but still it fails.

Here are my screenshots:

Can you please suggest how to proceed?

Best regards

Ruben

Greg_Galloway · ‎01-05-2023

@Ruben Hartenstein notice the difference in the icons in your screenshot (they are notebooks) vs. the icons in the Arvind's post. You need to use this menu option to create an arbitrary file, not a notebook:

hardy1982 · ‎01-09-2023

Hi @Greg Galloway ,

thank you very much for your reply. I did according to your suggestion but now I face another error when executing the pipeline.

Can you please advice?

Best regards

Ruben

Greg_Galloway · ‎01-09-2023

@Ruben Hartenstein I don't see any @dlt.table mentions in your code. I'm assuming that error means the pipeline evaluated your code and didn't find any DLT tables in it. Maybe study a few of the samples as a template?

hardy1982 · ‎01-10-2023

@Greg Galloway

Is there no way to run the pipeline without @dlt?

I just want to use hive tables in my coding.

Greg_Galloway · ‎01-10-2023

If you don't want any Delta Live Tables, then just use the Jobs tab under the Workflows tab. Or there are plenty of other ways of just running a notebook in whatever orchestration tool you use (e.g. Azure Data Factory, etc.) @Ruben Hartenstein

Kaniz · ‎06-14-2022

Hi @Jeremy Colson, We haven’t heard from you on the last response from @Arvind Ravish, and I was checking back to see if you have a resolution yet. If you have any solution, please share it with the community as it can be helpful to others. Otherwise, we will respond with more details and try to help.

Greg_Galloway · ‎10-20-2022

I like the approach @Arvind Ravish shared since you can't currently use %run in DLT pipelines. However, it took a little testing to be clear on how exactly to make it work.

First, ensure in the Admin Console that the repos feature is configured as follows:

Then create a new arbitrary file named Import.py using the following menu option. (Note, it does not work with a Notebook.)

The file should contain code like the following:

MYVAR1 = "hi"
MYVAR2 = 99
MYVAR3 = "hello"
 
def factorial(num):
    fact=1
    for i in range(1,num+1):
        fact = fact*i
    return fact

In the DLT notebook, the following code loads Import.py and executes the Python code in it. Then MYVAR1, MYVAR2, MYVAR3, and the factorial function will be available for reference downstream in the pipeline.

import pyspark.sql.functions as f
 
txt = spark.read.text("file:/Workspace/Repos/FolderName/RepoName/Import.py") 
 
#concatenate all lines of the file into a single string
singlerow = txt.agg(f.concat_ws("\r\n", f.collect_list(txt.value)))
data = "\r\n".join(singlerow.collect()[0])
 
#execute that string of python
exec(data)

This appears to work in both Current and Preview channel DLT pipelines at the moment.

Unfortunately, the os.getcwd() command doesn't appear to be working in DLT pipelines (as it returns /databricks/driver even when the DLT pipeline notebook is in a Repo) so I haven't figured out a way to use a relative path even if your calling notebook is also in Repos. The following currently fails and Azure support case 2211240040000106 has been opened:

import os
import pyspark.sql.functions as f
 
txt = spark.read.text(f"file:{os.getcwd()}/Import.py") 
 
#concatenate all lines of the file into a single string
singlerow = txt.agg(f.concat_ws("\r\n", f.collect_list(txt.value)))
data = "\r\n".join(singlerow.collect()[0])
 
#execute that string of python
exec(data)

I'm also having trouble using the import example from aravish without using a hardcoded path like sys.path.append("/Workspace/Repos/TopFolder/RepoName") when running in a DLT pipeline. The aravish approach is useful if you want to import function definitions but not execute any Python code and not define any variables which will be visible in the calling notebook's Spark session.

Note: Edited from a previous post where I made a few mistakes.

Jfoxyyc · ‎01-05-2023

What my setup looks like is two workspaces, one dev, one prod. The repos folder where dlt pipelines run from is called dev in dev and prod in prod. I use secret scopes to retrieve the appropriate text to be able to path to the correct environments sys.path.append(some/path/to/my.py) so I can from my.py import method/class.