cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Use .R file in data pipeline

NW1000
New Contributor III

In general R pipeline, we can use source("abc.R") file. However, it does not work in Databricks. I got errors: 

source("./abc.R"), but the error say: Error in file(filename, "r", encoding = encoding) : cannot open the connection

How best to build a pipeline with R and .R files? 

Thank you.

1 REPLY 1

Louis_Frolio
Databricks Employee
Databricks Employee

Greetings @NW1000 , I did some digging and have some helpful hints to share.

The behavior you are seeing is a bit unintuitive if youโ€™re coming from a local R setup.

Hereโ€™s whatโ€™s going on.

In Databricks, source("./abc.R") fails because the R process on the cluster isnโ€™t running in the same local directory as your project files. From the clusterโ€™s point of view, ./abc.R simply doesnโ€™t exist. Thatโ€™s why you see:

Error in file(filename, โ€œrโ€, encoding = encoding) : cannot open the connection

It naturally follows that anything relying on relative paths (like ./abc.R) will break unless that file is explicitly available in the clusterโ€™s filesystem.

So how do you structure an R pipeline in Databricks? There are two patterns that work well.

First option: keep your .R files and call them with source() (bootstrap notebook pattern)

This is the closest match to a traditional setup.

Start by putting your .R files somewhere the cluster can actually see them. That typically means storing them in the Databricks workspace or in a Unity Catalog Volume.

From there, instead of running the script directly, you create a small R notebook that acts as a launcher. Databricks explicitly recommends this pattern, especially for teams migrating from Spark Submit.

At its core, the notebook just parameterizes and calls source():

dbutils.widgets.text(โ€œscript_pathโ€, โ€œโ€, โ€œPath to scriptโ€)

script_path <- dbutils.widgets.get(โ€œscript_pathโ€)

source(script_path)

When you run this notebookโ€”either interactively or as part of a Jobโ€”you pass in a full path to the script (for example, a workspace path or a Volume path).

Thatโ€™s the key shift: donโ€™t rely on ./abc.R. Always pass an explicit path that the cluster can resolve.

To orchestrate this, use Jobs (Workflows). Each stage in your pipeline can be a notebook task that calls a different .R file via source(). This lets you keep your existing scripts while aligning with how Databricks expects workloads to run.

Second option (recommended long term): convert .R files into notebooks

If youโ€™re thinking beyond a lift-and-shift, this is the more maintainable direction.

Databricksโ€™ guidance for R is to treat notebooks as the primary unit of execution and orchestration.

You can import existing .R scripts directly into the workspace and Databricks will convert them into notebooks. Alternatively, add this line to the top of your file:

 

Databricks notebook source

and import itโ€”Databricks will recognize it as a notebook.

Once youโ€™re in that model, you can modularize your code using %run.

For example, put shared functions in a helper notebook, then in your main pipeline notebook:

%run /path/to/helper_notebook

This executes the helper notebook inline and makes its functions available in the caller.

From there, use Jobs (Lakeflow Workflows) to orchestrate the pipeline, with each stage defined as a notebook task. There isnโ€™t a dedicated โ€œR scriptโ€ task typeโ€”under the hood, notebooks are the abstraction Databricks optimizes for.

Summary

The failure of source("./abc.R") comes down to execution context: the file isnโ€™t present in the clusterโ€™s working directory.

In practice, you have two solid paths:

  • Use a lightweight R notebook that calls source(script_path) on scripts stored in workspace/Volumes, and orchestrate via Jobs

  • Or convert your .R files into notebooks and orchestrate those directly, using %run for reuse

Most teams start with the first and evolve toward the second as the pipeline matures.

Cheers, Louis