cancel
Showing results for 
Search instead for 
Did you mean: 
Get Started Discussions
Start your journey with Databricks by joining discussions on getting started guides, tutorials, and introductory topics. Connect with beginners and experts alike to kickstart your Databricks experience.
cancel
Showing results for 
Search instead for 
Did you mean: 

Load GCP data to Databricks using R

dZegpi
New Contributor II

I'm working with Databricks and Google Cloud in the same project. I want to load specific datasets stored in GCP into a R notebook in Databricks. I currently can see the datasets in BigQuery. The problem is that using the sparklyr package, I'm not able to see the datasets that I have in GCP. Instead, I see the name of other datasets I was not aware of. 

This is my R code

 

library(sparklyr)
sc <- spark_connect(method = "databricks")

# List table names in my spark connection
dplyr::src_tbls(sc)
# Can't see the tables sotred in GCP

# Try (and fail) to load the tables
spark_read_table(sc, "my-table-name")

 

How can I access my tables stored in GCP through Databricks notebooks using R?; Is sparklyr the right approach?

3 REPLIES 3

Kaniz
Community Manager
Community Manager

Hi @dZegpi, To access your Google Cloud Platform (GCP) datasets in Databricks Notebooks using R, you can use the sparklyr package. Let’s break down the steps:

  1. Load Required Packages: Load the necessary packages in your R notebook. In a Databricks workspace, you don’t need to install them; they are already included in the Databricks Runtime. Run the following code to load SparkR, sparklyr, and dplyr:
    library(SparkR) 
    library(sparklyr) 
    library(dplyr)

Connect to the Databricks cluster: Use spark_connect to establish a connection to the Databricks cluster.

Specify the connection method as “databricks”:

 

sc <- spark_connect(method = "databricks")

 

Note that if you’re working within a Databricks Notebook, a SparkSession is already established, so you don’t need to call SparkR::sparkR.session.

Query GCP Tables: To access your GCP datasets, you can use SQL queries to bridge SparkR and sparklyr. For example:

  • Use SparkR::sql to query tables created with sparklyr.
  • Use sparklyr::sdf_sql to query tables created with SparkR.
  • Remember that dplyr code gets translated to SQL in memory before execution.

Example: Suppose you want to read a BigQuery table named “my-table-name”. You can do so using the following code:

 

# Read the BigQuery table into a Spark DataFrame 
df <- spark_read_table(sc, "my-table-name")

 

Now, you can work with the df DataFrame in your R notebook.

dZegpi
New Contributor II

Other than library(SparkR), this is the same code I posted. This does not solve my problem.

Kaniz
Community Manager
Community Manager

Our End-of-Year Community Survey is here! Please take a few moments to complete the survey. Your feedback matters!

Join 100K+ Data Experts: Register Now & Grow with Us!

Excited to expand your horizons with us? Click here to Register and begin your journey to success!

Already a member? Login and join your local regional user group! If there isn’t one near you, fill out this form and we’ll create one for you to join!