3 weeks ago - last edited 3 weeks ago
Is it possible to load data only using Databricks SDK?
I have custom library that has to load data to a table, and I know about other features like autoloader, COPY INTO, notebook with spark dataframe... but I wonder if it is possible to load data directly to a table just using the Databricks SDK using files from local disk.
Thanks
3 weeks ago
it is not possible to load data directly into a table using the Databricks SDK by reading files from the local disk. The Databricks SDK primarily focuses on managing and interacting with Databricks resources such as clusters, jobs, and libraries, but it does not provide direct functionality for loading data from local disk files into tables.
However, you can use other methods to achieve this. One approach is to use Databricks file system utilities (dbutils.fs
) to move files from the local disk to DBFS (Databricks File System) and then use Spark to load the data into a table. Here is a general outline of the steps:
dbutils.fs.cp
or %fs cp
to copy files from the local disk to DBFS.3 weeks ago
3 weeks ago
Can you provide the specific error message when you tried to load the data from the workspace file system?
In regards your SDK code, it seems to be correct to load data to DBFS.
3 weeks ago
3 weeks ago
Are you using a shared access mode cluster to run this? If yes, can you try it with single user mode?
3 weeks ago
Currently using the Serverless compute but ideally my custom library shouldn't limit the cluster choice.
It feels like I only have some options:
- upload data to volume and run COPY INTO
- upload data to DBFS and run COPY INTO
- or leverage the pre-configured spark client session, and use spark in my custom library
I am not a databricks expert - please correct me if I am wrong.
3 weeks ago
Hi @alwaysmoredata ,
Uploading data to DBFS and then running COPY INTO won't work if you want to use cluster with shared access mode. This is because in Shared access mode, the driver node's local file system is not accessible.
So, based on your requirements the best way is to upload data to volume and run COPY INTO command.
3 weeks ago
Got it, the reason of the cluster is because with shared access cluster the access to local system is more restricted that with single user cluster due to security constraints. As you are using serverless it acts as a shared cluster, on this case your above statements will be correct with the usage of serverless.
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.
If there isn’t a group near you, start one and help create a community that brings people together.
Request a New Group