Has anyone successfully used Petastorm + Databricks-Connect + Delta Lake?
The use case is being able to use DeltaLake as a data store regardless of whether I want to use the databricks workspace or not for my training tasks.
I'm using a cloud-hosted jupyterlab environment(in Paperspace), and trying to use Petastorm + Databricks Connect.
What I'm trying to do:
- Connect to cluster via databricks-connect
- Read in data from delta lake using a databricks spark cluster
- Use Petastorm to convert the dataframe into a pytorch ready object
The exact same code, on the same cluster works when using the databricks notebook environment. But when running the `make_spark_converter()` function in my hosted jupyterlab environment it throws me a "Unable to infer schema" error. Even though if I check the `.schema` attribute of the dataframe I'm giving it, it shows me a spark compatible schema.