cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Is it possible to load data only using Databricks SDK?

alwaysmoredata
New Contributor

Is it possible to load data only using Databricks SDK?

I have custom library that has to load data to a table, and I know about other features like autoloader, COPY INTO, notebook with spark dataframe... but I wonder if it is possible to load data directly to a table just using the Databricks SDK using files from local disk.

Thanks

8 REPLIES 8

Walter_C
Databricks Employee
Databricks Employee

it is not possible to load data directly into a table using the Databricks SDK by reading files from the local disk. The Databricks SDK primarily focuses on managing and interacting with Databricks resources such as clusters, jobs, and libraries, but it does not provide direct functionality for loading data from local disk files into tables.

However, you can use other methods to achieve this. One approach is to use Databricks file system utilities (dbutils.fs) to move files from the local disk to DBFS (Databricks File System) and then use Spark to load the data into a table. Here is a general outline of the steps:

  1. Move Files to DBFS: Use dbutils.fs.cp or %fs cp to copy files from the local disk to DBFS.
  2. Load Data with Spark: Use Spark to read the files from DBFS and load them into a table.
     

Thanks for the incredibly quick reply - youโ€™re faster than any AI assistant on the market!

I had a feeling this was the case, but itโ€™s great to have it confirmed.

What options do I have for loading data without using Spark? Iโ€™m working with a custom library that Iโ€™d like data scientists to use within Databricks notebooks to upload data in a standardized way. My library doesn't use spark, currently using "COPY INTO from s3" but I wonder if I can upload files from notebook avoiding have to configure something like s3 (stage).

Would it be a bad idea to upload the files to a Volume and then execute a COPY INTO command to load the data from the Volume? Is this the only option available? I tried to run COPY INTO from workspace file system but it failed with a forbidden error.

Additionally, can I use the following snippet to upload data to DBFS?

from databricks.sdk import WorkspaceClient
w = WorkspaceClient()
dbutils = w.dbutils
dbutils.fs.cp ...

Walter_C
Databricks Employee
Databricks Employee

Can you provide the specific error message when you tried to load the data from the workspace file system?

In regards your SDK code, it seems to be correct to load data to DBFS.

SparkConnectGrpcException: (java.lang.SecurityException) Cannot use com.databricks.backend.daemon.driver.WorkspaceLocalFileSystem - local filesystem access is forbidden

COPY INTO my_company.test.event_sample
FROM 'file:/Workspace/Users/alwaysmoredata@mycompany.com/sample.json'
FILEFORMAT = JSON;

Walter_C
Databricks Employee
Databricks Employee

Are you using a shared access mode cluster to run this? If yes, can you try it with single user mode?

Currently using the Serverless compute but ideally my custom library shouldn't limit the cluster choice.

It feels like I only have some options:

- upload data to volume and run COPY INTO

- upload data to DBFS and run COPY INTO

- or leverage the pre-configured spark client session, and use spark in my custom library

I am not a databricks expert - please correct me if I am wrong.

Hi @alwaysmoredata ,

Uploading data to DBFS and then running COPY INTO won't work if you want to use cluster with shared access mode. This is because in Shared access mode, the driver node's local file system is not accessible. 

So, based on your requirements the best way is to upload data to volume and run COPY INTO command.

 

Walter_C
Databricks Employee
Databricks Employee

Got it, the reason of the cluster is because with shared access cluster the access to local system is more restricted that with single user cluster due to security constraints. As you are using serverless it acts as a shared cluster, on this case your above statements will be correct with the usage of serverless.

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโ€™t want to miss the chance to attend and share knowledge.

If there isnโ€™t a group near you, start one and help create a community that brings people together.

Request a New Group