DBUtils from databricks-connect and runtime are quite different libraries....

stevenayers-bge
Contributor

If you find yourself using dbutils in any of your code, and you're testing locally vs running on a cluster, there's a few gotchas to be very careful of when it comes to listing files in Volumes or files on DBFS.

The DBUtils you'll use locally installed by databricks-connect:

 

from databricks.connect import DatabricksSession
from pyspark.dbutils import DBUtils

spark = DatabricksSession.builder.profile('dev').getOrCreate()
dbutils = DBUtils(spark)

 

...compared to the one already instantiated in your notebooks, these are entirely different libraries with different interfaces, which led me down a very frustrating rabbit hole yesterday afternoon.

If you run `my_files = dbutils.fs.ls(<your path here>)` and you want to find out if the `FileInfo` objects you got back are a directory or a file, the behaviour differs.

Locally:

 

my_files = dbutils.fs.ls(/some/path/to/files)
first_file = my_files[0]

first_file.isDir()
# ERRORS - function does not exist

first_file.size
# This will be None if a directory

 

In a Databricks notebook:

 

my_files = dbutils.fs.ls(/some/path/to/files)
first_file = my_files[0]

first_file.isDir()
# Returns boolean true/false

first_file.size
# This will be zero as an integer if a directory 

 

If you do need to reliably check if a `FileInfo` object is a directory across both environments, you can emulate the `.isDir()` function by using `first_file.name.endswith('/')`. Below is the FileInfo definition from runtime:

 

# ************* DBC Public API ***************


# This class definition should be kept in sync with FileInfo definition in runtime dbutils.py
# See https://livegrep.dev.databricks.com/view/databricks/runtime/python/pyspark/dbutils.py#L60
class FileInfo(namedtuple('FileInfo', ['path', 'name', 'size', "modificationTime"])):
    def isDir(self):
        return self.name.endswith('/')

    def isFile(self):
        return not self.isDir()

    @staticmethod
    def create_from_jschema(j_file_info):
        return FileInfo(
            path=j_file_info.path(),
            name=j_file_info.name(),
            size=j_file_info.size(),
            modificationTime=j_file_info.modificationTime())

 

stevenayers-bge
Contributor

To make things more confusing, the Databricks SDK definition of `FileInfo` changes again:

@dataclass
class FileInfo:
    file_size: Optional[int] = None
    """The length of the file in bytes. This field is omitted for directories."""

    is_dir: Optional[bool] = None
    """True if the path is a directory."""

    modification_time: Optional[int] = None
    """Last modification time of given file in milliseconds since epoch."""

    path: Optional[str] = None
    """The absolute path of the file or directory."""

    def as_dict(self) -> dict:
        """Serializes the FileInfo into a dictionary suitable for use as a JSON request body."""
        body = {}
        if self.file_size is not None: body['file_size'] = self.file_size
        if self.is_dir is not None: body['is_dir'] = self.is_dir
        if self.modification_time is not None: body['modification_time'] = self.modification_time
        if self.path is not None: body['path'] = self.path
        return body

    @classmethod
    def from_dict(cls, d: Dict[str, any]) -> FileInfo:
        """Deserializes the FileInfo from a dictionary."""
        return cls(file_size=d.get('file_size', None),
                   is_dir=d.get('is_dir', None),
                   modification_time=d.get('modification_time', None),
                   path=d.get('path', None))

szymon_dybczak
Esteemed Contributor III

Hi @stevenayers-bge ,

Thanks for sharing. I didn't know that these interfaces aren't align with each other.