11-27-2023 12:42 PM
Hello folks,
Is there a way with sql query to get count from delta table metadata without doing count(*) on each of table? Wondering, if this information is stored in any of INFORMATION_SCHEMA tables.
I have a use-case to get counts from 1000's of delta tables and do some further processing based on count.
It doesn't need to exact count but an estimate would be fine too.
11-27-2023 08:11 PM
Hi @JDL , Certainly! When dealing with Delta tables, you can leverage the metadata stored within the Delta Lake itself to estimate the row count without explicitly executing COUNT(*) on each table.
Here are a few approaches:
Delta Lake Stats:
SQL Query Using Delta Lake Stats:
Note on Accuracy:
Remember that Delta Lake stores metadata within the _delta_log folder alongside the data files. The exact location depends on your storage configuration.
Additionally, if you’re using external tables, you’ll need to manually delete the data files when dropping an external table.
Feel free to adapt the approach based on your specific use case! 🚀🔍
11-28-2023 02:15 PM
Thanks @Kaniz for your response. I tried the sql query but gives error about the path.
I tried different version of path.
If I specify full path of table, which I retrieve from catalog explorer starting with abfss:// getting below error:
[RequestId=28a3048d-1371-4908-96cd-090961fe9356 ErrorClass=INVALID_PARAMETER_VALUE.LOCATION_OVERLAP] Input path url 'abfss://<>.dfs.core.windows.net/__unitystorage/catalogs/<>/tables/<>' overlaps with managed storage within 'GenerateTemporaryPathCredential' call
If I specify path without abfss:// getting error, Path must be absolute
Thoughts?
11-29-2023 12:19 AM
Hi @JDL ,
When working with paths in Databricks, it’s essential to understand the nuances of different storage locations. Let’s break down the issues you’re encountering:
Managed Storage Overlap Error:
INVALID_PARAMETER_VALUE.LOCATION_OVERLAP
, indicates that the input path URL overlaps with organised storage.dbutils.fs.ls
is not supported.Using abfss://
Prefix:
abfss://
, you’re referring to Azure Blob Storage (ABFS) locations.abfss://
the complete path to the desired resource.Navigating to Parent Paths:
dbutils.fs.ls
with a trailing slash at the end of the path.dbutils.fs.ls("abfss://<storage-account-name>.dfs.core.windows.net/parent-folder/")
This should display the contents of the parent directory.Remember that IAM permissions are crucial in accessing S3 buckets or ABFS storage. Verify the permissions and ensure that your paths are correctly formatted. If you encounter further issues, feel free to seek additional assistance! 🚀
12-01-2023 05:51 AM
Thanks @Kaniz for detailed information but I am still not clear on resolution.
2. I am using path starting with abfss://. Copying the exact path of this table from catalog explorer.
3. I am not familiar if dbutils.fs.ls can be used in SQL Query. I need this information using SQL only due to some limitation.
All IAM permissions are in place.
11-30-2023 04:06 PM
12-01-2023 05:53 AM
Thanks @SSundaram for the link. I need this information via sql query only.
12-03-2023 06:44 AM
Certainly! While the exact count can be obtained COUNT(*)
, you can estimate the number of rows in a Delta table without scanning the entire table by leveraging the metadata. Here are a couple of approaches:
Using Delta Lake Metadata:
import com.databricks.sql.transaction.tahoe.DeltaLog
import org.apache.hadoop.fs.Path
import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.functions._
val deltaTablePath = "location_of_the_table"
def getRecordCount(deltaTablePath: String😞 Any = {
val snapshot = DeltaLog.forTable(spark, new Path(deltaTablePath)).update()
val statsSchema = snapshot.statsSchema
val files = snapshot.allFiles.withColumn("stats", from_json($"stats", statsSchema))
val dfWithNumRecords = files.select($"path", $"stats.numRecords".as("numRecords"))
val totalCount = dfWithNumRecords.select(sum($"numRecords")).first().get(0)
return totalCount
}
println(getRecordCount(deltaTablePath))
Replace "location_of_the_table"
with the actual path to your Delta table.
Using INFORMATION_SCHEMA (for other databases):
INFORMATION_SCHEMA.TABLES
View to get an estimate of the row count. Keep in mind that this estimate might not be exact for InnoDB tables:SELECT SUM(TABLE_ROWS)
FROM INFORMATION_SCHEMA.TABLES
WHERE TABLE_SCHEMA = '{your_db}';
Remember that these methods provide estimates, but they can significantly speed up your processing when dealing with large numbers of tables. Choose the approach that best fits your use case! 🚀