cancel
Showing results for 
Search instead for 
Did you mean: 
Warehousing & Analytics
Engage in discussions on data warehousing, analytics, and BI solutions within the Databricks Community. Share insights, tips, and best practices for leveraging data for informed decision-making.
cancel
Showing results for 
Search instead for 
Did you mean: 

Databricks UUID

David_Dabbs
Databricks Partner

Is there a plan for Databricks to support the UUIDv7 variant that is friendlier to use in databases for keys and partitioning?
https://python.plainenglish.io/python-3-14-brings-uuidv6-v7-v8-stop-using-uuid4-like-its-2015-90518f...

Also, related, Python 3.1.4 is the first release that natively supports the v7 UUID variant. 
Would it be possible to say when one could expect Python 3.1.4 to land in a LTS release?

Thank you.

1 ACCEPTED SOLUTION

Accepted Solutions

Ah yes apologies - that was confusing. To implement uuidv7 in DATABRICKS (using Databricks SQL) (without relying on Neon/Postgres), you can leverage Databricks' native support for the uuid() function (v4) and standard SQL to construct a v7-compliant identifier... AKA you can create a SQL UDF to generate them. 

CREATE OR REPLACE FUNCTION generate_uuidv7()
RETURNS STRING
LANGUAGE SQL
AS
  SELECT 
    printf('%012x-%s-%s-%s-%s',
      -- 48-bit timestamp in milliseconds
      CAST(unix_millis(current_timestamp()) AS LONG),
      -- Version 7 and first 12 random bits (hex starts with '7')
      substring(hex(random()), 1, 4),
      -- Variant 1 and next 12 random bits (hex starts with 8, 9, A, or B)
      substring(hex(random()), 5, 4),
      substring(hex(random()), 9, 4),
      substring(hex(random()), 13, 12)
    );

For partitioning - I will mention that using a high-cardinality ID could create the "small file problem". Instead, parition by the date of the time component of the uuidv7. Alternatively, if you are using Delta Lake, use Liquid Clustering as it handles high-cardinality keys much better. 

CREATE TABLE events (
  id STRING,
  event_date DATE GENERATED ALWAYS AS (
    CAST(from_unixtime(conv(substring(id, 1, 12), 16, 10) / 1000) AS DATE)
  )
)
PARTITIONED BY (event_date);

 

View solution in original post

3 REPLIES 3

sarahbhord
Databricks Employee
Databricks Employee

Hey @David_Dabbs - We do support uuidv7 database keys and partitioning in Lakebase PostgreSQL via pg_uuidv7  and Databricks SQL functions. There are some limitations when it comes to clustering and column generation. 

As for Python 3.1.4 - this is an outdated minor release, and modern Databricks runtimes target versions 3.10-3.12 range. Are you inquiring about "3.11.4"? If so, Python 3.11.4 will not land in a Databricks Runtime LTS. Instead, 15.4 LTS already standardizes on Python 3.11.11, and newer LTS lines use Python 3.12.3.

I hope this helps! 

Sarah

 

 

Thank you @sarahbhord.


We do support uuidv7 database keys and partitioning in Lakebase PostgreSQL via pg_uuidv7  and Databricks SQL functions. There are some limitations when it comes to clustering and column generation. 

I'm not following the Neon Databricks SQL connection. We would want to use uuidv7 in Databricks, not Neon.

As for Python 3.1.4 - this is an outdated minor release, and modern Databricks runtimes target versions 3.10-3.12 range. Are you inquiring about "3.11.4"? If so, Python 3.11.4 will not land in a Databricks Runtime LTS. Instead, 15.4 LTS already standardizes on Python 3.11.11, and newer LTS lines use Python 3.12.3.


Yes, 3.1.4 was a typo on my part. UUIDv7 support landed in Python with version 3.14

Ah yes apologies - that was confusing. To implement uuidv7 in DATABRICKS (using Databricks SQL) (without relying on Neon/Postgres), you can leverage Databricks' native support for the uuid() function (v4) and standard SQL to construct a v7-compliant identifier... AKA you can create a SQL UDF to generate them. 

CREATE OR REPLACE FUNCTION generate_uuidv7()
RETURNS STRING
LANGUAGE SQL
AS
  SELECT 
    printf('%012x-%s-%s-%s-%s',
      -- 48-bit timestamp in milliseconds
      CAST(unix_millis(current_timestamp()) AS LONG),
      -- Version 7 and first 12 random bits (hex starts with '7')
      substring(hex(random()), 1, 4),
      -- Variant 1 and next 12 random bits (hex starts with 8, 9, A, or B)
      substring(hex(random()), 5, 4),
      substring(hex(random()), 9, 4),
      substring(hex(random()), 13, 12)
    );

For partitioning - I will mention that using a high-cardinality ID could create the "small file problem". Instead, parition by the date of the time component of the uuidv7. Alternatively, if you are using Delta Lake, use Liquid Clustering as it handles high-cardinality keys much better. 

CREATE TABLE events (
  id STRING,
  event_date DATE GENERATED ALWAYS AS (
    CAST(from_unixtime(conv(substring(id, 1, 12), 16, 10) / 1000) AS DATE)
  )
)
PARTITIONED BY (event_date);