cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

DBUtils from databricks-connect and runtime are quite different libraries....

stevenayers-bge
Contributor

If you find yourself using dbutils in any of your code, and you're testing locally vs running on a cluster, there's a few gotchas to be very careful of when it comes to listing files in Volumes or files on DBFS.

The DBUtils you'll use locally installed by databricks-connect:

 

from databricks.connect import DatabricksSession
from pyspark.dbutils import DBUtils

spark = DatabricksSession.builder.profile('dev').getOrCreate()
dbutils = DBUtils(spark)

 

...compared to the one already instantiated in your notebooks, these are entirely different libraries with different interfaces, which led me down a very frustrating rabbit hole yesterday afternoon.

If you run `my_files = dbutils.fs.ls(<your path here>)` and you want to find out if the `FileInfo` objects you got back are a directory or a file, the behaviour differs.

Locally:

 

my_files = dbutils.fs.ls(/some/path/to/files)
first_file = my_files[0]

first_file.isDir()
# ERRORS - function does not exist

first_file.size
# This will be None if a directory

 

In a Databricks notebook:

 

my_files = dbutils.fs.ls(/some/path/to/files)
first_file = my_files[0]

first_file.isDir()
# Returns boolean true/false

first_file.size
# This will be zero as an integer if a directory 

 

If you do need to reliably check if a `FileInfo` object is a directory across both environments, you can emulate the `.isDir()` function by using `first_file.name.endswith('/')`. Below is the FileInfo definition from runtime:

 

# ************* DBC Public API ***************


# This class definition should be kept in sync with FileInfo definition in runtime dbutils.py
# See https://livegrep.dev.databricks.com/view/databricks/runtime/python/pyspark/dbutils.py#L60
class FileInfo(namedtuple('FileInfo', ['path', 'name', 'size', "modificationTime"])):
    def isDir(self):
        return self.name.endswith('/')

    def isFile(self):
        return not self.isDir()

    @staticmethod
    def create_from_jschema(j_file_info):
        return FileInfo(
            path=j_file_info.path(),
            name=j_file_info.name(),
            size=j_file_info.size(),
            modificationTime=j_file_info.modificationTime())

 

2 REPLIES 2

stevenayers-bge
Contributor

To make things more confusing, the Databricks SDK definition of `FileInfo` changes again:

@dataclass
class FileInfo:
    file_size: Optional[int] = None
    """The length of the file in bytes. This field is omitted for directories."""

    is_dir: Optional[bool] = None
    """True if the path is a directory."""

    modification_time: Optional[int] = None
    """Last modification time of given file in milliseconds since epoch."""

    path: Optional[str] = None
    """The absolute path of the file or directory."""

    def as_dict(self) -> dict:
        """Serializes the FileInfo into a dictionary suitable for use as a JSON request body."""
        body = {}
        if self.file_size is not None: body['file_size'] = self.file_size
        if self.is_dir is not None: body['is_dir'] = self.is_dir
        if self.modification_time is not None: body['modification_time'] = self.modification_time
        if self.path is not None: body['path'] = self.path
        return body

    @classmethod
    def from_dict(cls, d: Dict[str, any]) -> FileInfo:
        """Deserializes the FileInfo from a dictionary."""
        return cls(file_size=d.get('file_size', None),
                   is_dir=d.get('is_dir', None),
                   modification_time=d.get('modification_time', None),
                   path=d.get('path', None))

szymon_dybczak
Contributor

Hi @stevenayers-bge ,

Thanks for sharing. I didn't know that these interfaces aren't align with each other.

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group