DBUtils from databricks-connect and runtime are quite different libraries....
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
09-17-2024 12:27 AM
If you find yourself using dbutils in any of your code, and you're testing locally vs running on a cluster, there's a few gotchas to be very careful of when it comes to listing files in Volumes or files on DBFS.
The DBUtils you'll use locally installed by databricks-connect:
from databricks.connect import DatabricksSession
from pyspark.dbutils import DBUtils
spark = DatabricksSession.builder.profile('dev').getOrCreate()
dbutils = DBUtils(spark)
...compared to the one already instantiated in your notebooks, these are entirely different libraries with different interfaces, which led me down a very frustrating rabbit hole yesterday afternoon.
If you run `my_files = dbutils.fs.ls(<your path here>)` and you want to find out if the `FileInfo` objects you got back are a directory or a file, the behaviour differs.
Locally:
my_files = dbutils.fs.ls(/some/path/to/files)
first_file = my_files[0]
first_file.isDir()
# ERRORS - function does not exist
first_file.size
# This will be None if a directory
In a Databricks notebook:
my_files = dbutils.fs.ls(/some/path/to/files)
first_file = my_files[0]
first_file.isDir()
# Returns boolean true/false
first_file.size
# This will be zero as an integer if a directory
If you do need to reliably check if a `FileInfo` object is a directory across both environments, you can emulate the `.isDir()` function by using `first_file.name.endswith('/')`. Below is the FileInfo definition from runtime:
# ************* DBC Public API ***************
# This class definition should be kept in sync with FileInfo definition in runtime dbutils.py
# See https://livegrep.dev.databricks.com/view/databricks/runtime/python/pyspark/dbutils.py#L60
class FileInfo(namedtuple('FileInfo', ['path', 'name', 'size', "modificationTime"])):
def isDir(self):
return self.name.endswith('/')
def isFile(self):
return not self.isDir()
@staticmethod
def create_from_jschema(j_file_info):
return FileInfo(
path=j_file_info.path(),
name=j_file_info.name(),
size=j_file_info.size(),
modificationTime=j_file_info.modificationTime())
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
09-17-2024 12:35 AM
To make things more confusing, the Databricks SDK definition of `FileInfo` changes again:
@dataclass
class FileInfo:
file_size: Optional[int] = None
"""The length of the file in bytes. This field is omitted for directories."""
is_dir: Optional[bool] = None
"""True if the path is a directory."""
modification_time: Optional[int] = None
"""Last modification time of given file in milliseconds since epoch."""
path: Optional[str] = None
"""The absolute path of the file or directory."""
def as_dict(self) -> dict:
"""Serializes the FileInfo into a dictionary suitable for use as a JSON request body."""
body = {}
if self.file_size is not None: body['file_size'] = self.file_size
if self.is_dir is not None: body['is_dir'] = self.is_dir
if self.modification_time is not None: body['modification_time'] = self.modification_time
if self.path is not None: body['path'] = self.path
return body
@classmethod
def from_dict(cls, d: Dict[str, any]) -> FileInfo:
"""Deserializes the FileInfo from a dictionary."""
return cls(file_size=d.get('file_size', None),
is_dir=d.get('is_dir', None),
modification_time=d.get('modification_time', None),
path=d.get('path', None))
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
09-17-2024 01:58 AM
Hi @stevenayers-bge ,
Thanks for sharing. I didn't know that these interfaces aren't align with each other.

