Databricks

andrew0117 · ‎04-16-2023

When I use SQL code like "create table myTable (column1 string, column2 string) using csv options('delimiter' = ',', 'header' = 'true') location 'pathToCsv'" to create a table from a single CSV file stored in a folder within an Azure Data Lake container, the table is created successfully with all the data in the CSV file. However, when I drop this table and recreate it using the same code with the addition of "partitioned by (column1)" before the "location" statement, the table is still created successfully but no data is loaded. Can someone explain why this occurs? Thanks!

pvignesh92 · ‎04-17-2023

Hi @andrew li, When you specify a path with LOCATION keyword, Spark will consider that to be an EXTERNAL table. So when you dropped the table, you underlying data if any will not be cleared.

So in you case, as this is an external table, you folder structure should already be partitioned Ex. ../<table_dir>/month=jan. Now when you create a table mentioning the partition column as month, Spark can scan the storage and detect the partitions.

Please check from your end if this is missed. If still does not work, providing the screenshots can help to understand the issue

View solution in original post

Ajay-Pandey · ‎04-17-2023

Hi @andrew li ,

If you are creating an external table dropping the table will never delete the actual data.

Just try to remove the directories and again create a new table.

andrew0117 · ‎04-17-2023

I'm recreate the table from the same csv file in the same location, so I cannot delete the csv file. Thanks!

pvignesh92 · ‎04-17-2023

Hi @andrew li, When you specify a path with LOCATION keyword, Spark will consider that to be an EXTERNAL table. So when you dropped the table, you underlying data if any will not be cleared.

So in you case, as this is an external table, you folder structure should already be partitioned Ex. ../<table_dir>/month=jan. Now when you create a table mentioning the partition column as month, Spark can scan the storage and detect the partitions.

Please check from your end if this is missed. If still does not work, providing the screenshots can help to understand the issue

andrew0117 · ‎04-17-2023

so basically, you cannot create a partitioned table based on a single csv file by simply using sql create table partitioned by () location 'pathToCsv' file? because the single csv file does not have the paritioned file structure on that location.

I understand the location clause means the external table and the real data is stored here. It is confusing sometimes because the location here actually is where the csv file located, and after you create a table from this csv file, nothing happens at this location. If you use scala spark, read this csv into a dataframe, then write it back to the same location, keep the csv format, and add partition, like spark.read.csv.load("pathToCsv").write.option("path", "PathToCSV").partitionBy("partitionColumn").mode("overwrite").format("csv").saveAsTable("mytable"), you will have some new files at that location.

Databricks

partition on a csv file

Unity Catalog Lakeguard: Industry-first and only data governance for multi-user Apache™ Spark cluste

Announcing the General Availability of Databricks Asset Bundles

Register now and save 50% on training at Data + AI Summit!

How to successfully build GenAI applications

Meet DBRX, the New Standard for High-Quality LLMs