Re: How does Databricks read data from an offline ...

Khaja_Zaffer · ‎09-02-2025

Hello @yinan

Good day!!

Databricks, being a cloud-based platform, does not have direct built-in support for reading data from a truly air-gapped (completely offline, no network connectivity) Cloudera Distribution for Hadoop (CDH) environment.

In such cases, data must be manually exported from the CDH cluster (e.g., using Hadoop tools like hdfs dfs -get to copy files from HDFS to local storage), physically transferred via portable media (e.g., external drives), and then uploaded to cloud storage accessible by Databricks, such as AWS S3, Azure Data Lake Storage (ADLS), or Google Cloud Storage (GCS) with tools like ADF which are ETL tools that help to extract the data from on prem to cloud storages.

Once in cloud storage, Databricks can read the data using Spark APIs like spark.read.parquet("s3://path/to/data").However, if "offline" refers to an on-premises CDH environment without public internet access but with potential private network connectivity to the cloud (a common enterprise setup), Databricks can read data directly via network integration.

This involves configuring secure connectivity between your cloud provider (AWS, Azure, or GCP) and the on-prem network, then using Spark to access HDFS paths.

I hope this would help you to understand your question. ( I am open to other solutions)