04-16-2023 08:39 PM
When I use SQL code like "create table myTable (column1 string, column2 string) using csv options('delimiter' = ',', 'header' = 'true') location 'pathToCsv'" to create a table from a single CSV file stored in a folder within an Azure Data Lake container, the table is created successfully with all the data in the CSV file. However, when I drop this table and recreate it using the same code with the addition of "partitioned by (column1)" before the "location" statement, the table is still created successfully but no data is loaded. Can someone explain why this occurs? Thanks!
04-17-2023 04:40 AM
Hi @andrew li, When you specify a path with LOCATION keyword, Spark will consider that to be an EXTERNAL table. So when you dropped the table, you underlying data if any will not be cleared.
So in you case, as this is an external table, you folder structure should already be partitioned Ex. ../<table_dir>/month=jan. Now when you create a table mentioning the partition column as month, Spark can scan the storage and detect the partitions.
Please check from your end if this is missed. If still does not work, providing the screenshots can help to understand the issue
04-17-2023 12:11 AM
Hi @andrew li ,
If you are creating an external table dropping the table will never delete the actual data.
Just try to remove the directories and again create a new table.
04-17-2023 05:22 PM
I'm recreate the table from the same csv file in the same location, so I cannot delete the csv file. Thanks!
04-17-2023 04:40 AM
Hi @andrew li, When you specify a path with LOCATION keyword, Spark will consider that to be an EXTERNAL table. So when you dropped the table, you underlying data if any will not be cleared.
So in you case, as this is an external table, you folder structure should already be partitioned Ex. ../<table_dir>/month=jan. Now when you create a table mentioning the partition column as month, Spark can scan the storage and detect the partitions.
Please check from your end if this is missed. If still does not work, providing the screenshots can help to understand the issue
04-17-2023 05:34 PM
so basically, you cannot create a partitioned table based on a single csv file by simply using sql create table partitioned by () location 'pathToCsv' file? because the single csv file does not have the paritioned file structure on that location.
I understand the location clause means the external table and the real data is stored here. It is confusing sometimes because the location here actually is where the csv file located, and after you create a table from this csv file, nothing happens at this location. If you use scala spark, read this csv into a dataframe, then write it back to the same location, keep the csv format, and add partition, like spark.read.csv.load("pathToCsv").write.option("path", "PathToCSV").partitionBy("partitionColumn").mode("overwrite").format("csv").saveAsTable("mytable"), you will have some new files at that location.
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.
If there isn’t a group near you, start one and help create a community that brings people together.
Request a New Group