cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

How to develop with databricks connect smoothly?

Johannes_E
New Contributor II

We are working with Databrick Connect and Visual Studio Code in our project. We mainly want to program in the IDE (VS Code) so that we can use the advantages of the IDE compared to notebooks. Therefore, we write most of the code in .py files and actually only use Notebooks to start a process step of our data pipeline. Another advantage of using .py files instead of notebooks is that we can store tests separately in our own Python files and run them with pytest. These tests can also be run automatically in gitlab within a Docker image (in which pyspark is running) that is used to check whether a merge request can be run.

Example of a notebook we use:

# Cell 1
from datalayer_warenflussprognose.get_files_from_sftp import (
    SftpConfig,
    get_files_from_sftp,
)
from datalayer_warenflussprognose.constants import TEMP_FOLDER, TRADEPARTNERS
from datalayer_warenflussprognose.utility import create_temp_folder

# Cell 2
# Set the SFTP configuration

SFTP_CONFIG = SftpConfig(
    sftp_user=dbutils.secrets.get(scope="KeyVault", key="transfer-markant-user"),
    sftp_user_dm=dbutils.secrets.get(scope="KeyVault", key="dm-sftp-password"),
    sftp_pw=dbutils.secrets.get(scope="KeyVault", key="transfer-markant-password"),
    sftp_pw_dm=dbutils.secrets.get(scope="KeyVault", key="dm-sftp-username"),
    ssh_key_dm=dbutils.secrets.get(scope="KeyVault", key="dm-sftp-ssh-key"),
)

# Cell 3
for tradepartner in TRADEPARTNERS:
    local_folder = TEMP_FOLDER / tradepartner.name
    get_files_from_sftp(
        sftp_config=SFTP_CONFIG, partner=tradepartner, local_folder=local_folder
    )

In this notebook, a few configuration variables are created and then the function “get_files_from_sftp” is called for each of our trade partners, which retrieves raw data (CSVs/zipped CSVs) from an SFTP server. However, this requires that we have the credentials for the SFTP, which are stored in our Azure key vault and which we retrieve from the key vault using “dbutils” (see cell 2). When working with Databricks Connect, there are several problems: 

1. You cannot debug a notebook! When right clicking a notebook in VS Code and choosing "Run on Databricks" there is only the option "Run File as Workfile". For us debugging is extremely important, however, as it is the only way to get to a point where an error occurs and fix it at the point where it exists. If I can only execute the code and receive feedback that there is an error at a certain point, it is quite difficult to fix it.

In contrast, you can debug a .py file with Databricks Connect. When right clicking a .py- file in VS Code and choosing "Run on Databricks" there is NOT only the option "Run File as Workfile" but also some others like "Debug current file with Databricks Connect". So, is there also a way to debug notebooks using databrick connect?

2. dbutils is causing us problems, because if, for example, you briefly transfer the code from the notebook (so that you can debug it) into a .py file and start the debug mode, it does not work with “dbutils”. The reason for this is probably that Databricks Connect executes all Python code locally in the virtual environment on your own PC during debugging and only executes all syntax that Spark needs (e.g. when Spark creates dataframes) in Databricks. At least that's what I've read. But since “dbutils” is probably perceived as Python code, an attempt is made to execute this code locally. But dbutils does not exist locally, but only in Databricks. It cannot be installed locally because the package is not available on pypi.org. Also, dbutils requires the databricks file store (dbfs) to run, which is not available loc...

Hence the overall question: How can we develop cleanly with Databricks Connect locally in the IDE without the problems described?

 

 

 

1 ACCEPTED SOLUTION

Accepted Solutions

ChrisChieu
Databricks Employee
Databricks Employee
  1. You can set break points and debug within notebook cells. There's an example in this DAIS talk at 15:27. I recommend the entire talk as a demo.
    1. To complete the point, here is an additional documentation about notebook cells debugging with Databricks Connect: https://docs.databricks.com/en/dev-tools/vscode-ext/notebooks.html
  2. For this you can use the Python SDK https://docs.databricks.com/en/dev-tools/sdk-python.html#use-databricks-utilities

 

 

View solution in original post

1 REPLY 1

ChrisChieu
Databricks Employee
Databricks Employee
  1. You can set break points and debug within notebook cells. There's an example in this DAIS talk at 15:27. I recommend the entire talk as a demo.
    1. To complete the point, here is an additional documentation about notebook cells debugging with Databricks Connect: https://docs.databricks.com/en/dev-tools/vscode-ext/notebooks.html
  2. For this you can use the Python SDK https://docs.databricks.com/en/dev-tools/sdk-python.html#use-databricks-utilities

 

 

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group