<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Issues with using Databricks-Connect and Petastorm in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/issues-with-using-databricks-connect-and-petastorm/m-p/32483#M23677</link>
    <description>&lt;P&gt;because its janky or why? I don't need it for customer facing production. More so for if I'm using my own HPC or local workstation, but I want to access data from delta lake. Figured it was easier/preferable to setting up my own spark environment locally. I'm paying for databricks might as well get the benefits of the runtime.&lt;/P&gt;&lt;P&gt;Can you elaborate on your answer?&lt;/P&gt;</description>
    <pubDate>Thu, 30 Dec 2021 04:22:33 GMT</pubDate>
    <dc:creator>YSF</dc:creator>
    <dc:date>2021-12-30T04:22:33Z</dc:date>
    <item>
      <title>Issues with using Databricks-Connect and Petastorm</title>
      <link>https://community.databricks.com/t5/data-engineering/issues-with-using-databricks-connect-and-petastorm/m-p/32481#M23675</link>
      <description>&lt;P&gt;Has anyone successfully used Petastorm + Databricks-Connect + Delta Lake?&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;The use case is being able to use DeltaLake as a data store regardless of whether I want to use the databricks workspace or not for my training tasks.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;I'm using a cloud-hosted jupyterlab environment(in Paperspace), and trying to use Petastorm + Databricks Connect.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;What I'm trying to do:&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;Connect to cluster via databricks-connect&lt;/LI&gt;&lt;LI&gt;Read in data from delta lake using a databricks spark cluster&lt;/LI&gt;&lt;/UL&gt;&lt;UL&gt;&lt;LI&gt;Use Petastorm to convert the dataframe into a pytorch ready object&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;The exact same code, on the same cluster works when using the databricks notebook environment.  But when running the `make_spark_converter()` function in my hosted jupyterlab environment it throws me a "Unable to infer schema" error. Even though if I check the `.schema` attribute of the dataframe I'm giving it, it shows me a spark compatible schema.&lt;/P&gt;</description>
      <pubDate>Sat, 25 Dec 2021 22:31:03 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/issues-with-using-databricks-connect-and-petastorm/m-p/32481#M23675</guid>
      <dc:creator>YSF</dc:creator>
      <dc:date>2021-12-25T22:31:03Z</dc:date>
    </item>
    <item>
      <title>Re: Issues with using Databricks-Connect and Petastorm</title>
      <link>https://community.databricks.com/t5/data-engineering/issues-with-using-databricks-connect-and-petastorm/m-p/32482#M23676</link>
      <description>&lt;P&gt;I would not definitely use Databricks-Connect in production.&lt;/P&gt;</description>
      <pubDate>Sun, 26 Dec 2021 15:07:25 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/issues-with-using-databricks-connect-and-petastorm/m-p/32482#M23676</guid>
      <dc:creator>Hubert-Dudek</dc:creator>
      <dc:date>2021-12-26T15:07:25Z</dc:date>
    </item>
    <item>
      <title>Re: Issues with using Databricks-Connect and Petastorm</title>
      <link>https://community.databricks.com/t5/data-engineering/issues-with-using-databricks-connect-and-petastorm/m-p/32483#M23677</link>
      <description>&lt;P&gt;because its janky or why? I don't need it for customer facing production. More so for if I'm using my own HPC or local workstation, but I want to access data from delta lake. Figured it was easier/preferable to setting up my own spark environment locally. I'm paying for databricks might as well get the benefits of the runtime.&lt;/P&gt;&lt;P&gt;Can you elaborate on your answer?&lt;/P&gt;</description>
      <pubDate>Thu, 30 Dec 2021 04:22:33 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/issues-with-using-databricks-connect-and-petastorm/m-p/32483#M23677</guid>
      <dc:creator>YSF</dc:creator>
      <dc:date>2021-12-30T04:22:33Z</dc:date>
    </item>
  </channel>
</rss>

