Hi @dZegpi, To access your Google Cloud Platform (GCP) datasets in Databricks Notebooks using R, you can use the sparklyr package. Let’s break down the steps:
- Load Required Packages: Load the necessary packages in your R notebook. In a Databricks workspace, you don’t need to install them; they are already included in the Databricks Runtime. Run the following code to load SparkR, sparklyr, and dplyr:
library(SparkR)
library(sparklyr)
library(dplyr)
Connect to the Databricks cluster: Use spark_connect to establish a connection to the Databricks cluster.
Specify the connection method as “databricks”:
sc <- spark_connect(method = "databricks")
Note that if you’re working within a Databricks Notebook, a SparkSession is already established, so you don’t need to call SparkR::sparkR.session.
Query GCP Tables: To access your GCP datasets, you can use SQL queries to bridge SparkR and sparklyr. For example:
- Use SparkR::sql to query tables created with sparklyr.
- Use sparklyr::sdf_sql to query tables created with SparkR.
- Remember that dplyr code gets translated to SQL in memory before execution.
Example: Suppose you want to read a BigQuery table named “my-table-name”. You can do so using the following code:
# Read the BigQuery table into a Spark DataFrame
df <- spark_read_table(sc, "my-table-name")
Now, you can work with the df DataFrame in your R notebook.