Use .R file in data pipeline
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
a week ago
In general R pipeline, we can use source("abc.R") file. However, it does not work in Databricks. I got errors:
source("./abc.R"), but the error say: Error in file(filename, "r", encoding = encoding) : cannot open the connection
How best to build a pipeline with R and .R files?
Thank you.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Wednesday
Greetings @NW1000 , I did some digging and have some helpful hints to share.
The behavior you are seeing is a bit unintuitive if you’re coming from a local R setup.
Here’s what’s going on.
In Databricks, source("./abc.R") fails because the R process on the cluster isn’t running in the same local directory as your project files. From the cluster’s point of view, ./abc.R simply doesn’t exist. That’s why you see:
Error in file(filename, “r”, encoding = encoding) : cannot open the connection
It naturally follows that anything relying on relative paths (like ./abc.R) will break unless that file is explicitly available in the cluster’s filesystem.
So how do you structure an R pipeline in Databricks? There are two patterns that work well.
First option: keep your .R files and call them with source() (bootstrap notebook pattern)
This is the closest match to a traditional setup.
Start by putting your .R files somewhere the cluster can actually see them. That typically means storing them in the Databricks workspace or in a Unity Catalog Volume.
From there, instead of running the script directly, you create a small R notebook that acts as a launcher. Databricks explicitly recommends this pattern, especially for teams migrating from Spark Submit.
At its core, the notebook just parameterizes and calls source():
dbutils.widgets.text(“script_path”, “”, “Path to script”)
script_path <- dbutils.widgets.get(“script_path”)
source(script_path)
When you run this notebook—either interactively or as part of a Job—you pass in a full path to the script (for example, a workspace path or a Volume path).
That’s the key shift: don’t rely on ./abc.R. Always pass an explicit path that the cluster can resolve.
To orchestrate this, use Jobs (Workflows). Each stage in your pipeline can be a notebook task that calls a different .R file via source(). This lets you keep your existing scripts while aligning with how Databricks expects workloads to run.
Second option (recommended long term): convert .R files into notebooks
If you’re thinking beyond a lift-and-shift, this is the more maintainable direction.
Databricks’ guidance for R is to treat notebooks as the primary unit of execution and orchestration.
You can import existing .R scripts directly into the workspace and Databricks will convert them into notebooks. Alternatively, add this line to the top of your file:
Databricks notebook source
and import it—Databricks will recognize it as a notebook.
Once you’re in that model, you can modularize your code using %run.
For example, put shared functions in a helper notebook, then in your main pipeline notebook:
%run /path/to/helper_notebook
This executes the helper notebook inline and makes its functions available in the caller.
From there, use Jobs (Lakeflow Workflows) to orchestrate the pipeline, with each stage defined as a notebook task. There isn’t a dedicated “R script” task type—under the hood, notebooks are the abstraction Databricks optimizes for.
Summary
The failure of source("./abc.R") comes down to execution context: the file isn’t present in the cluster’s working directory.
In practice, you have two solid paths:
-
Use a lightweight R notebook that calls source(script_path) on scripts stored in workspace/Volumes, and orchestrate via Jobs
-
Or convert your .R files into notebooks and orchestrate those directly, using %run for reuse
Most teams start with the first and evolve toward the second as the pipeline matures.
Cheers, Louis