- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
12-25-2021 02:31 PM
Has anyone successfully used Petastorm + Databricks-Connect + Delta Lake?
The use case is being able to use DeltaLake as a data store regardless of whether I want to use the databricks workspace or not for my training tasks.
I'm using a cloud-hosted jupyterlab environment(in Paperspace), and trying to use Petastorm + Databricks Connect.
What I'm trying to do:
- Connect to cluster via databricks-connect
- Read in data from delta lake using a databricks spark cluster
- Use Petastorm to convert the dataframe into a pytorch ready object
The exact same code, on the same cluster works when using the databricks notebook environment. But when running the `make_spark_converter()` function in my hosted jupyterlab environment it throws me a "Unable to infer schema" error. Even though if I check the `.schema` attribute of the dataframe I'm giving it, it shows me a spark compatible schema.
- Labels:
-
Databricks connect
-
Petastorm
Accepted Solutions
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
12-26-2021 07:07 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
12-26-2021 07:07 AM
I would not definitely use Databricks-Connect in production.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
12-29-2021 08:22 PM
because its janky or why? I don't need it for customer facing production. More so for if I'm using my own HPC or local workstation, but I want to access data from delta lake. Figured it was easier/preferable to setting up my own spark environment locally. I'm paying for databricks might as well get the benefits of the runtime.
Can you elaborate on your answer?