Is it possible to load data only using Databricks SDK?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
12-30-2024 06:26 AM - edited 12-30-2024 06:29 AM
Is it possible to load data only using Databricks SDK?
I have custom library that has to load data to a table, and I know about other features like autoloader, COPY INTO, notebook with spark dataframe... but I wonder if it is possible to load data directly to a table just using the Databricks SDK using files from local disk.
Thanks
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
12-30-2024 06:29 AM
it is not possible to load data directly into a table using the Databricks SDK by reading files from the local disk. The Databricks SDK primarily focuses on managing and interacting with Databricks resources such as clusters, jobs, and libraries, but it does not provide direct functionality for loading data from local disk files into tables.
However, you can use other methods to achieve this. One approach is to use Databricks file system utilities (dbutils.fs
) to move files from the local disk to DBFS (Databricks File System) and then use Spark to load the data into a table. Here is a general outline of the steps:
- Move Files to DBFS: Use
dbutils.fs.cp
or%fs cp
to copy files from the local disk to DBFS. - Load Data with Spark: Use Spark to read the files from DBFS and load them into a table.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
12-30-2024 06:44 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
12-30-2024 06:53 AM
Can you provide the specific error message when you tried to load the data from the workspace file system?
In regards your SDK code, it seems to be correct to load data to DBFS.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
12-30-2024 06:58 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
12-30-2024 07:03 AM
Are you using a shared access mode cluster to run this? If yes, can you try it with single user mode?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
12-30-2024 07:11 AM
Currently using the Serverless compute but ideally my custom library shouldn't limit the cluster choice.
It feels like I only have some options:
- upload data to volume and run COPY INTO
- upload data to DBFS and run COPY INTO
- or leverage the pre-configured spark client session, and use spark in my custom library
I am not a databricks expert - please correct me if I am wrong.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
12-30-2024 07:22 AM
Hi @alwaysmoredata ,
Uploading data to DBFS and then running COPY INTO won't work if you want to use cluster with shared access mode. This is because in Shared access mode, the driver node's local file system is not accessible.
So, based on your requirements the best way is to upload data to volume and run COPY INTO command.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
12-30-2024 07:20 AM
Got it, the reason of the cluster is because with shared access cluster the access to local system is more restricted that with single user cluster due to security constraints. As you are using serverless it acts as a shared cluster, on this case your above statements will be correct with the usage of serverless.

