cancel
Showing results for 
Search instead for 
Did you mean: 
Administration & Architecture
cancel
Showing results for 
Search instead for 
Did you mean: 

Get number of rows in delta lake table from metadata without count(*)

JDL
New Contributor III

Hello folks,

Is there a way with sql query to get count from delta table metadata without doing count(*) on each of table? Wondering, if this information is stored in any of INFORMATION_SCHEMA tables.

I have a use-case to get counts from 1000's of delta tables and do some further processing based on count.

It doesn't need to exact count but an estimate would be fine too.

7 REPLIES 7

Kaniz
Community Manager
Community Manager

Hi @JDL , Certainly! When dealing with Delta tables, you can leverage the metadata stored within the Delta Lake itself to estimate the row count without explicitly executing COUNT(*) on each table.

 

Here are a few approaches:

 

Delta Lake Stats:

  • Delta Lake maintains statistics (referred to as “stats”) about the data in each table version. These stats include the number of records (numRecords) in each file.
  • You can query the numRecords from the transaction log to get an estimate of the row count for a specific Delta table version.
  • For example, to estimate the total row count across all files in a given Delta table version, you can sum up the numRecords for all files.

SQL Query Using Delta Lake Stats:

  • Suppose you have a Delta table named my_table.
  • To estimate the row count for the entire table, you can use the following SQL query: SELECT SUM(numRecords) AS estimated_row_count FROM delta.`/path/to/my_table`
  • Replace /path/to/my_table with the actual path to your Delta table.

Note on Accuracy:

  • Keep in mind that the numRecords value is an estimate and may not always be exact. It provides a quick way to get an approximate row count without scanning the entire table.
  • If you need precise counts, you’ll still need to use COUNT(*), but leveraging the stats can significantly speed up the process.

Remember that Delta Lake stores metadata within the _delta_log folder alongside the data files. The exact location depends on your storage configuration.

 

Additionally, if you’re using external tables, you’ll need to manually delete the data files when dropping an external table.

 

Feel free to adapt the approach based on your specific use case! 🚀🔍

JDL
New Contributor III

Thanks @Kaniz for your response. I tried the sql query but gives error about the path.

I tried different version of path.
If I specify full path of table, which I retrieve from catalog explorer starting with abfss:// getting below error:

[RequestId=28a3048d-1371-4908-96cd-090961fe9356 ErrorClass=INVALID_PARAMETER_VALUE.LOCATION_OVERLAP] Input path url 'abfss://<>.dfs.core.windows.net/__unitystorage/catalogs/<>/tables/<>' overlaps with managed storage within 'GenerateTemporaryPathCredential' call

If I specify path without abfss:// getting error,
Path must be absolute

Thoughts?

Kaniz
Community Manager
Community Manager

Hi @JDL ,

When working with paths in Databricks, it’s essential to understand the nuances of different storage locations. Let’s break down the issues you’re encountering:

  1. Managed Storage Overlap Error:

    • The error message you received, INVALID_PARAMETER_VALUE.LOCATION_OVERLAP, indicates that the input path URL overlaps with organised storage.
    • In the Unity Catalog, schema storage locations are reserved for managed storage. Attempting to list the contents of an organised directory using dbutils.fs.ls is not supported.
    • To avoid this error, ensure you’re not trying to access managed storage locations directly.
  2. Using abfss:// Prefix:

    • When specifying paths abfss://, you’re referring to Azure Blob Storage (ABFS) locations.
    • If you encounter the error “Path must be absolute,” it suggests that the path you provided is not fully qualified.
    • Ensure your path starts with abfss:// the complete path to the desired resource.
  3. Navigating to Parent Paths:

    • To list the contents of a parent directory without causing conflicts, consider using dbutils.fs.ls with a trailing slash at the end of the path.
    • For example:
      dbutils.fs.ls("abfss://<storage-account-name>.dfs.core.windows.net/parent-folder/")
      
      This should display the contents of the parent directory.

Remember that IAM permissions are crucial in accessing S3 buckets or ABFS storage. Verify the permissions and ensure that your paths are correctly formatted. If you encounter further issues, feel free to seek additional assistance! 🚀

JDL
New Contributor III

Thanks @Kaniz for detailed information but I am still not clear on resolution.

  1. Table type is shown as MANAGED. Does this mean I can't get count of this table without count(*)?

JDL_0-1701438470267.png

2. I am using path starting with abfss://. Copying the exact path of this table from catalog explorer.

3. I am not familiar if dbutils.fs.ls can be used in SQL Query. I need this information using SQL only due to some limitation.

All IAM permissions are in place.

JDL
New Contributor III

Thanks @SSundaram for the link. I need this information via sql query only.

Kaniz
Community Manager
Community Manager

Certainly! While the exact count can be obtained COUNT(*), you can estimate the number of rows in a Delta table without scanning the entire table by leveraging the metadata. Here are a couple of approaches:

  1. Using Delta Lake Metadata:

    • You can extract the total number of records from the Delta table metadata. Here’s an example in Scala using Databricks:
    import com.databricks.sql.transaction.tahoe.DeltaLog
    import org.apache.hadoop.fs.Path
    import org.apache.spark.sql.DataFrame
    import org.apache.spark.sql.functions._
    
    val deltaTablePath = "location_of_the_table"
    def getRecordCount(deltaTablePath: String😞 Any = {
        val snapshot = DeltaLog.forTable(spark, new Path(deltaTablePath)).update()
        val statsSchema = snapshot.statsSchema
        val files = snapshot.allFiles.withColumn("stats", from_json($"stats", statsSchema))
        val dfWithNumRecords = files.select($"path", $"stats.numRecords".as("numRecords"))
        val totalCount = dfWithNumRecords.select(sum($"numRecords")).first().get(0)
        return totalCount
    }
    
    println(getRecordCount(deltaTablePath))
    

    Replace "location_of_the_table" with the actual path to your Delta table.

  2. Using INFORMATION_SCHEMA (for other databases):

    • In databases like MySQL, you can query the INFORMATION_SCHEMA.TABLES View to get an estimate of the row count. Keep in mind that this estimate might not be exact for InnoDB tables:
    SELECT SUM(TABLE_ROWS)
    FROM INFORMATION_SCHEMA.TABLES
    WHERE TABLE_SCHEMA = '{your_db}';
    

    For precise counts, you’ll still need to use COUNT(*).

Remember that these methods provide estimates, but they can significantly speed up your processing when dealing with large numbers of tables. Choose the approach that best fits your use case! 🚀

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.