cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Data in Unity Catalog that can't be previewed

DavidKxx
Contributor

This is a small deficiency, but a fix would be nice to have.

For a long time now, the Sample Data previewer in the Unity Catalog explorer has been unable to show tables that contain a certain kind of column.  Instead of showing sample rows of the table, it shows:


Error getting sample data
Unexpected token '(', "(8000,[0,2"... is not valid JSON

That little data snippet mentioning "8000" is enough to point me to a specific column causing the trouble, whose type can be described in several different ways. 

1)  The Overview tab shows it as a struct that breaks out as:
{"type": "tinyint", "size": "int", "indices": {"items": "int"}, "values": {"items": "double"}}

2)  A SQL DESCRIBE shows the column as having datatype "vector".

3)  The column was created via a UDF whose key operation is: 

from pyspark.ml.linalg import Vectors
 ...
[output] = Vectors.sparse(inputs)

Any chance of getting this fixed so that the table containing this data type can be previewed?

1 ACCEPTED SOLUTION

Accepted Solutions

Ashwin_DSA
Databricks Employee
Databricks Employee

Hi @DavidKxx,

Thanks for flagging this. You're right, the Sample Data previewer in Catalog Explorer is choking because your column is a Spark ML vector type (pyspark.ml.linalg.VectorUDT, what Vectors.sparse(...) returns). The previewer is trying to JSON-parse the stringified vector ((8000,[0,2,...],[...])), which obviously isn't JSON, and that's why the whole tab fails rather than just the one column. UC's Overview tab and DESCRIBE surface the same column differently (as a struct and as vector), which is consistent with this being a rendering issue rather than a problem with the data itself.

While we get this fixed on the product side, a couple of workarounds for previewing... Run a quick SELECT in SQL Editor or a notebook with the vector cast to a string or broken out via vector_to_array(col) into a regular array column, and those will render fine. Creating a view that exposes the vector as an array column is a nice way to keep the "quick preview" experience without changing your base table.

I've gone ahead and logged this internally so the Catalog Explorer team can pick it up and prioritise a fix. One quick thing that would help when they triage... Could you confirm roughly how large the vectors are in your table (the 8000 in the error suggests that dimension), and whether you're seeing this only on sparse vectors or also on dense ones? No need for the actual data, just a rough shape.

If this answer resolves your question, could you mark it as “Accept as Solution”? That helps other users quickly find the correct fix.

Regards,
Ashwin | Delivery Solution Architect @ Databricks
Helping you build and scale the Data Intelligence Platform.
***Opinions are my own***

View solution in original post

2 REPLIES 2

Ashwin_DSA
Databricks Employee
Databricks Employee

Hi @DavidKxx,

Thanks for flagging this. You're right, the Sample Data previewer in Catalog Explorer is choking because your column is a Spark ML vector type (pyspark.ml.linalg.VectorUDT, what Vectors.sparse(...) returns). The previewer is trying to JSON-parse the stringified vector ((8000,[0,2,...],[...])), which obviously isn't JSON, and that's why the whole tab fails rather than just the one column. UC's Overview tab and DESCRIBE surface the same column differently (as a struct and as vector), which is consistent with this being a rendering issue rather than a problem with the data itself.

While we get this fixed on the product side, a couple of workarounds for previewing... Run a quick SELECT in SQL Editor or a notebook with the vector cast to a string or broken out via vector_to_array(col) into a regular array column, and those will render fine. Creating a view that exposes the vector as an array column is a nice way to keep the "quick preview" experience without changing your base table.

I've gone ahead and logged this internally so the Catalog Explorer team can pick it up and prioritise a fix. One quick thing that would help when they triage... Could you confirm roughly how large the vectors are in your table (the 8000 in the error suggests that dimension), and whether you're seeing this only on sparse vectors or also on dense ones? No need for the actual data, just a rough shape.

If this answer resolves your question, could you mark it as “Accept as Solution”? That helps other users quickly find the correct fix.

Regards,
Ashwin | Delivery Solution Architect @ Databricks
Helping you build and scale the Data Intelligence Platform.
***Opinions are my own***

DavidKxx
Contributor

Yes, my vector space is commonly of dimension 4000 or 8000.

I don't write any dense vectors to table; can't speak to what happens previewing that type.

Thanks for taking up the issue!