topic Re: Reading a table from a catalog that is in a different/external workspace in Data Engineering

Reading a table from a catalog that is in a different/external workspace

addy — Mon, 04 Mar 2024 21:33:27 GMT

I am trying to read a table that is hosted on a different workspace. We have been told to establish a connection to said workspace using a table and consume the table.

Code I am using is

from databricks import sql

connection = sql.connect(
server_hostname="adb-123.azuredatabricks.net",
http_path="/sql/1.0/warehouses/hahaha",
access_token="pass"
)

cursor = connection.cursor()

cursor.execute("SELECT * FROM table")

# Fetch all rows into a list
rows = cursor.fetchall()

# Create a PySpark DataFrame from the list of rows
df = spark.createDataFrame(rows, schema=["A", "B", "C"])

# Close the cursor and connection when done
cursor.close()
connection.close()

Is there a better way of doing this? Is there a way to directly run a spark read and get the data into a pyspark dataframe? This method doesn't appear to be that efficient. Please note that we have to go with the token here and directly consume the table.

Re: Reading a table from a catalog that is in a different/external workspace

Allia — Tue, 05 Mar 2024 08:33:41 GMT

Hi Addy

Greetings!

You can also use Delta sharing to share the data across multiple workspaces. Since you want to read tables from another workspace you can use databricks to databricks delta sharing.

https://docs.databricks.com/en/data-sharing/read-data-databricks.html

https://docs.databricks.com/en/data-sharing/index.html

Re: Reading a table from a catalog that is in a different/external workspace

addy — Thu, 07 Mar 2024 16:48:19 GMT

Hi Allia,

Thanks for the helpful links. Unfortunately the client I am working with has a hard requirement for us to use the token. Even my first instinct was to set up a delta share. But unfortunately that is not the case. I am looking for a more efficient way to read the data but by using the token. If you see my code I am getting all the rows first and then creating a dataframe on it. I am looking for a more efficient way to do that, maybe ingest directly into a dataframe.