topic Re: Issues with using Databricks-Connect and Petastorm in Data Engineering

Issues with using Databricks-Connect and Petastorm

YSF — Sat, 25 Dec 2021 22:31:03 GMT

Has anyone successfully used Petastorm + Databricks-Connect + Delta Lake?

The use case is being able to use DeltaLake as a data store regardless of whether I want to use the databricks workspace or not for my training tasks.

I'm using a cloud-hosted jupyterlab environment(in Paperspace), and trying to use Petastorm + Databricks Connect.

What I'm trying to do:

Connect to cluster via databricks-connect
Read in data from delta lake using a databricks spark cluster

Use Petastorm to convert the dataframe into a pytorch ready object

The exact same code, on the same cluster works when using the databricks notebook environment. But when running the `make_spark_converter()` function in my hosted jupyterlab environment it throws me a "Unable to infer schema" error. Even though if I check the `.schema` attribute of the dataframe I'm giving it, it shows me a spark compatible schema.

Re: Issues with using Databricks-Connect and Petastorm

Hubert-Dudek — Sun, 26 Dec 2021 15:07:25 GMT

I would not definitely use Databricks-Connect in production.

Re: Issues with using Databricks-Connect and Petastorm

YSF — Thu, 30 Dec 2021 04:22:33 GMT

because its janky or why? I don't need it for customer facing production. More so for if I'm using my own HPC or local workstation, but I want to access data from delta lake. Figured it was easier/preferable to setting up my own spark environment locally. I'm paying for databricks might as well get the benefits of the runtime.

Can you elaborate on your answer?