cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

How to query deltatables stored in s3 through databricks SQL Endpoint?

athjain
New Contributor III

the delta tables after ETL are stored in s3 in csv or parquet format, so now question is how to allow databricks sql endpoint to run query over s3 saved files

1 ACCEPTED SOLUTION

Accepted Solutions

Hubert-Dudek
Esteemed Contributor III

First mount bucket:

%python
aws_bucket_name = "<aws-bucket-name>"
mount_name = "<mount-name>"
 
dbutils.fs.mount("s3a://%s:%s@%s" % (access_key, encoded_secret_key, aws_bucket_name), "/mnt/%s" % mount_name)

than register delta folder as table:

%sql
CREATE TABLE `t`
USING delta
LOCATION '/tmp/delta'

View solution in original post

5 REPLIES 5

-werners-
Esteemed Contributor III

You provide the data by using a hive metastore. This is basically putting a semantic view on top of your files so the data is served as a classic table. (f.e. saveAsTable or CREATE TABLE)

A Sql endpoint can access this hive metastore.

Do not forget to set up the data access (the sql endpoint needs access to the data with a service principal).

athjain
New Contributor III

Thanks @Werner Stinckensโ€‹ for your response

But the question is that if in future I would like to move from databricks and the delta tables stored in dbfs (at backend in s3 using parquet file format)

How will I be able to retrieve it from the files in s3?

if instead of using dbfs (saveAsTable stores delta table in dbfs) to store delta tables, how we can use s3 directly and retrive the tables via sql endpoint

Or how to connect that sql endpoint to directly point to s3 for the tables

-werners-
Esteemed Contributor III

Ok I get it.

In dbfs you have the option to use managed tables (data is managed by the databricks workspace) or unmanaged tables (data resides in an external storage like S3, ADLS etc).

So in your case you would be looking at unmanaged tables.

The SQL endpoint of course needs a table interface, so you still have to define tables on top of your data.

But those are only metadata.

You could even opt for an external metastore.

For example check this or this.

Hubert-Dudek
Esteemed Contributor III

First mount bucket:

%python
aws_bucket_name = "<aws-bucket-name>"
mount_name = "<mount-name>"
 
dbutils.fs.mount("s3a://%s:%s@%s" % (access_key, encoded_secret_key, aws_bucket_name), "/mnt/%s" % mount_name)

than register delta folder as table:

%sql
CREATE TABLE `t`
USING delta
LOCATION '/tmp/delta'

Anonymous
Not applicable

Hey @Athlestan Jainโ€‹ 

How are you doing?

Thanks for posting your question. Do you think you were able to resolve the issue?

We'd love to hear from you.

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโ€™t want to miss the chance to attend and share knowledge.

If there isnโ€™t a group near you, start one and help create a community that brings people together.

Request a New Group