cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

partition on a csv file

andrew0117
Contributor

When I use SQL code like "create table myTable (column1 string, column2 string) using csv options('delimiter' = ',', 'header' = 'true') location 'pathToCsv'" to create a table from a single CSV file stored in a folder within an Azure Data Lake container, the table is created successfully with all the data in the CSV file. However, when I drop this table and recreate it using the same code with the addition of "partitioned by (column1)" before the "location" statement, the table is still created successfully but no data is loaded. Can someone explain why this occurs? Thanks!

1 ACCEPTED SOLUTION

Accepted Solutions

pvignesh92
Honored Contributor

Hi @andrew liโ€‹, When you specify a path with LOCATION keyword, Spark will consider that to be an EXTERNAL table. So when you dropped the table, you underlying data if any will not be cleared.

So in you case, as this is an external table, you folder structure should already be partitioned Ex. ../<table_dir>/month=jan. Now when you create a table mentioning the partition column as month, Spark can scan the storage and detect the partitions.

Please check from your end if this is missed. If still does not work, providing the screenshots can help to understand the issue

View solution in original post

4 REPLIES 4

Ajay-Pandey
Esteemed Contributor III

Hi @andrew liโ€‹ ,

If you are creating an external table dropping the table will never delete the actual data.

Just try to remove the directories and again create a new table.

Ajay Kumar Pandey

I'm recreate the table from the same csv file in the same location, so I cannot delete the csv file. Thanks!

pvignesh92
Honored Contributor

Hi @andrew liโ€‹, When you specify a path with LOCATION keyword, Spark will consider that to be an EXTERNAL table. So when you dropped the table, you underlying data if any will not be cleared.

So in you case, as this is an external table, you folder structure should already be partitioned Ex. ../<table_dir>/month=jan. Now when you create a table mentioning the partition column as month, Spark can scan the storage and detect the partitions.

Please check from your end if this is missed. If still does not work, providing the screenshots can help to understand the issue

so basically, you cannot create a partitioned table based on a single csv file by simply using sql create table partitioned by () location 'pathToCsv' file? because the single csv file does not have the paritioned file structure on that location.

I understand the location clause means the external table and the real data is stored here. It is confusing sometimes because the location here actually is where the csv file located, and after you create a table from this csv file, nothing happens at this location. If you use scala spark, read this csv into a dataframe, then write it back to the same location, keep the csv format, and add partition, like spark.read.csv.load("pathToCsv").write.option("path", "PathToCSV").partitionBy("partitionColumn").mode("overwrite").format("csv").saveAsTable("mytable"), you will have some new files at that location.

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโ€™t want to miss the chance to attend and share knowledge.

If there isnโ€™t a group near you, start one and help create a community that brings people together.

Request a New Group