cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

Creating external tables using gzipped CSV file - S3 URI without extensions

AdityaM
New Contributor II

Hi Databricks community,

Hope you are doing well.
I am trying to create an external table using a Gzipped CSV file uploaded to an S3 bucket.
The S3 URI of the resource doesn't have any file extensions, but the content of the file is a Gzipped comma separated file that I want to read into the External Table.

The command I'm using is:

CREATE EXTERNAL TABLE `mycatalog`.`myExternalTable`(
    `ID` STRING,
    `value` STRING
)
USING CSV
OPTIONS (
    PATH 's3://mybucket/filename',
    HEADER 'false',
    encoding 'UTF-8',
    compression 'gzip',
    delimiter ','
);

If I try to create the table using that exact same file, in the same bucket, with the .gz extension, it works.
But without that extension, it gives me a weird jumbled output(on doing select * on the table) indicating that decompression is not happening properly.
Is there a way to create the table without adding any extensions to the S3 file path?

Thanks for your time,
Aditya

2 REPLIES 2

Kaniz
Community Manager
Community Manager

Hi @AdityaM, It seems you’re encountering an issue with creating an external table from a Gzipped CSV file in Databricks using an S3 URI without file extensions.

Let’s address this step by step.

  1. SerDe (Serializer/Deserializer):

    • When creating an external table, you can specify a SerDe to handle the data format. In your case, you’ve used the USING CSV clause, which implies that Databricks will use the default CSV SerDe.
    • However, for Gzipped CSV files, you might need to use a different SerDe that understands both CSV and Gzip compression.
  2. Alternative Approach:

    • Instead of relying on the default CSV SerDe, consider using the org.apache.hadoop.hive.serde2.OpenCSVSerde SerDe. This SerDe is specifically designed for handling CSV files.
    • Here’s how you can modify your CREATE EXTERNAL TABLE statement:
    CREATE EXTERNAL TABLE `mycatalog`.`myExternalTable` (
        `ID` STRING,
        `value` STRING
    )
    ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
    STORED AS TEXTFILE
    LOCATION 's3://mybucket/filename'
    TBLPROPERTIES ('skip.header.line.count'='1');
    
    • The key change here is specifying the org.apache.hadoop.hive.serde2.OpenCSVSerde as the SerDe.
  3. Handling Date Columns:

    • If your CSV file contains date columns, ensure that the date format matches the expected format (e.g., YYYY-MM-DD).
    • You can adjust the column data types accordingly.
  4. File Extensions:

    • Regarding the file extension, it’s generally recommended to include the .gz extension in the S3 file path when dealing with Gzipped files. However, if you want to omit the extension, you can do so.
    • Make sure that the file content is indeed Gzipped, even if the extension is missing.
  5. Testing:

    • After modifying your table definition, try creating the external table again without the .gz extension.
    • Run a SELECT * FROM mycatalog.myExternalTable query to verify that the data is correctly decompressed and loaded.

Remember to adjust the column data types and other settings as needed for your specific use case. If you encounter any further issues, feel free to ask for additional assistance! 😊

References:

  1. Stack Overflow: Create external table from CSV file in AWS Athena
  2. Guide to create external table in Greenplum using Amazon S3 Buckets
  3. Databricks Community: Creating external tables using gzipped CSV file (S3 URI without extension)
  4. How to Load GZIP files from S3 into Snowflake1234
 

AdityaM
New Contributor II

Hey  , thanks for your response. 

I tried using a Serde(I think the OpenCSVSerde should work for me) but unfortunately im getting the below from the Unity Catalog:

[UC_DATASOURCE_NOT_SUPPORTED] Data source format hive is not supported in Unity Catalog. SQLSTATE: 0AKUC

Can you please suggest any other workarounds for the above?

Thanks

@Kaniz