cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Can external tables be created backed by current cloud files without ingesting files in Databricks?

Jennifer
New Contributor III

Hi,

We have huge amount of parquet files in s3 with the path pattern <bucket>/<customer>/yyyy/mm/dd/hh/.*.parquet.

The question is can I create a external table in Unity Catalog from this external location without actually ingesting the files? Like what it can be done in AWS Athena:

CREATE EXTERNAL TABLE tbl(
<schema>
)
PARTITION BY (...)
LOCATION
's3://<bucket>/'

 I have tried a couple of ways to create such a table resulting in errors either as "there is not delta commit logs" or "DELTA_CREATE_TABLE_WITH_NON_EMPTY_LOCATION". 

In other words, I don't want to save raw data in Databricks again since we have them in s3. However, I want to be able to query the raw data from Databricks.

6 REPLIES 6

Nik_Vanderhoof
Contributor

Hi,

Yes, Databricks does support creating external tables whose storage is not managed by Unity Catalog. Have you seen the following page of documentation: https://docs.databricks.com/aws/en/tables/external? It describes how to create external tables, and has an example notebook.

Jennifer
New Contributor III

I have read this doc, but I had no luck, not working.

Nik_Vanderhoof
Contributor

Can you share the commands you have tried, the errors they produced, and which Databricks Runtime versions you used? Those will help to debug.

Data_Mavericks
New Contributor II

 i think the issue is that you are trying to create a DELTA table in Unity catalog from an Parquet source without converting it to Delta format first.

As Unity catalog will not allow delta table to be created in an non-empty location. Since you want to expose directly in UC without writting delta log, you can try by creating external table
%sqlCREATE EXTERNAL TABLE catalog_name.schema_name.table_name

USING PARQUET

LOCATION 'S3 location
but you will lose Versioning and time travel. 

Hope this will work otherwise 
convert parquet into delta in place first then try to register external table. But i am sure i have done directly exposing in catalog by simply registering external table 

@Jennifer

harshit

@Data_Mavericks, I have actually tried using your suggested way, but the table is empty after creating it. Not sure why.

A recently Databricks feature which can federate AWS Glue may help me out since we have these files in s3 defined in Glue as tables already.

@Jennifer i am not sure about this i have done through ADLS Gen2. For your approach managed role will be required i think. But if you can share some more details i think if you managed to expose in UC and not seeing data then i have recently done by creating a VIEW in UC where i am directly exposing my source files to UC. 

 

-- Replace with your actual catalog, schema, and view names
CREATE OR REPLACE VIEW Enviourment.Schema.tablename

AS
SELECT
Record_type,
ACCOUNT_NUMBER,
CUSTOMER_NUMBER,
MEMBERSHIP_TYPE_CODE,
POINTS_BALANCE,
POINTS_EARNED,
ACCOUNT_TERMINATED_DATE,
FROM parquet.`s3://your-bucket-name/path/to/parquet/files/`
-- Optional filter to ensure data integrity, if needed
WHERE ACCOUNT_NUMBER IS NOT NULL

I think simplest way to expose if you dont want to hold the anywhere in between. As per my understanding if you want to create a table its best practice to hold data somewhere in external location then register. 

Let me know if i correctly understood your issue  

harshit