Databricks Community

BasavarajAngadi · ‎02-13-2022

If we use two different clusters one for pyspark code for transformation and one for SQL analytics . how to make permenant tables derived from pyspark code make available for running queries in databricks SQL analytics

AmanSehgal · ‎02-13-2022

You can write your ETL logic in notebooks, run the notebook over a cluster and write the data to a location where your S3 bucket is mounted.

Next, you can register that table with Hive MetaStore and access the same table in Databricks SQL.

To see the table, go to Data tab and select your schema/database to see registered tables.

Two ways to do this:

Option 1:

df.write.option("path",<s3-path-of-table>).saveAsTable(tableName)

Option 2

%python
df.write.save(<s3-path-of-table>)
 
%sql
CREATE TABLE <table-name>
USING DELTA
LOCATION <s3-path-of-table>

:

View solution in original post

AmanSehgal · ‎02-13-2022

You can write your ETL logic in notebooks, run the notebook over a cluster and write the data to a location where your S3 bucket is mounted.

Next, you can register that table with Hive MetaStore and access the same table in Databricks SQL.

To see the table, go to Data tab and select your schema/database to see registered tables.

Two ways to do this:

Option 1:

df.write.option("path",<s3-path-of-table>).saveAsTable(tableName)

Option 2

%python
df.write.save(<s3-path-of-table>)
 
%sql
CREATE TABLE <table-name>
USING DELTA
LOCATION <s3-path-of-table>

:

BasavarajAngadi · ‎02-14-2022

@Aman Sehgal so basically you are telling to write the transformed data from Databricks pyspark into ADLS gen2 and then use Data bricks SQL analytics to do below what you said ...

%sql
CREATE TABLE <table-name>
USING DELTA
LOCATION <s3-path-of-table>

AmanSehgal · ‎02-14-2022

Right.. Databricks is a platform to perform transformations.. Ideally your should either mount s3 bucket or ADLS gen 2 location in DBFS..

Read/Write/Update/Delete your data and to run SQL analytics from SQL tab, you'll have to register a table and start an endpoint..

You can also query the data via notebooks by using SQL in a cell. The only difference is, you'll have to spin up a cluster instead of an endpoint.

BasavarajAngadi · ‎02-15-2022

@Aman Sehgal you are making me confused ....we need to spin up the cluster if we use SQL end point right ?

and Can we not use magic commands "%Sql" within same notebook to write the pyspark data to SQL end point as table ?

AmanSehgal · ‎02-15-2022

Screen Shot 2022-02-16 at 12.54.55 am When you're in Data Engineering tab in workspace, then you need to spin up a cluster. After spinning up the cluster, you can create a notebook and use %sql to write SQL command and query your table.

When you're in SQL tab in workspace, then you need to spin up a SQL Endpoint. After spinning an end point, go to Queries tab and you can write a SQL query to query your tables.

Screen Shot 2022-02-16 at 12.54.37 am