If you find yourself using dbutils in any of your code, and you're testing locally vs running on a cluster, there's a few gotchas to be very careful of when it comes to listing files in Volumes or files on DBFS.
The DBUtils you'll use locally installed by databricks-connect:
from databricks.connect import DatabricksSession
from pyspark.dbutils import DBUtils
spark = DatabricksSession.builder.profile('dev').getOrCreate()
dbutils = DBUtils(spark)
...compared to the one already instantiated in your notebooks, these are entirely different libraries with different interfaces, which led me down a very frustrating rabbit hole yesterday afternoon.
If you run `my_files = dbutils.fs.ls(<your path here>)` and you want to find out if the `FileInfo` objects you got back are a directory or a file, the behaviour differs.
Locally:
my_files = dbutils.fs.ls(/some/path/to/files)
first_file = my_files[0]
first_file.isDir()
# ERRORS - function does not exist
first_file.size
# This will be None if a directory
In a Databricks notebook:
my_files = dbutils.fs.ls(/some/path/to/files)
first_file = my_files[0]
first_file.isDir()
# Returns boolean true/false
first_file.size
# This will be zero as an integer if a directory
If you do need to reliably check if a `FileInfo` object is a directory across both environments, you can emulate the `.isDir()` function by using `first_file.name.endswith('/')`. Below is the FileInfo definition from runtime:
# ************* DBC Public API ***************
# This class definition should be kept in sync with FileInfo definition in runtime dbutils.py
# See https://livegrep.dev.databricks.com/view/databricks/runtime/python/pyspark/dbutils.py#L60
class FileInfo(namedtuple('FileInfo', ['path', 'name', 'size', "modificationTime"])):
def isDir(self):
return self.name.endswith('/')
def isFile(self):
return not self.isDir()
@staticmethod
def create_from_jschema(j_file_info):
return FileInfo(
path=j_file_info.path(),
name=j_file_info.name(),
size=j_file_info.size(),
modificationTime=j_file_info.modificationTime())