cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

How does Databricks read data from an offline CDH environment?

yinan
New Contributor II
 
5 REPLIES 5

-werners-
Esteemed Contributor III

what do you mean by offline?
Actually down, or disconnected from the public internet?
Databricks can only access systems where it has access to, and that can be done using endpoints/vpn.

If none of these are possible, Databricks cannot reach it, there is no local Databricks agent/gateway.
But you might have an ETL tool available which has access to the system and can write to a cloud storage?

yinan
New Contributor II

1. The network is connected and accessible.
2. I am currently using the free version for debugging and have found that I cannot connect to the offline HDFS. I have not been able to locate the place to change the configuration. Is this because the free version does not support this feature?

Khaja_Zaffer
Contributor

Hello @yinan 

Good day!!

Databricks, being a cloud-based platform, does not have direct built-in support for reading data from a truly air-gapped (completely offline, no network connectivity) Cloudera Distribution for Hadoop (CDH) environment.
 
In such cases, data must be manually exported from the CDH cluster (e.g., using Hadoop tools like hdfs dfs -get to copy files from HDFS to local storage), physically transferred via portable media (e.g., external drives), and then uploaded to cloud storage accessible by Databricks, such as AWS S3, Azure Data Lake Storage (ADLS), or Google Cloud Storage (GCS) with tools like ADF which are ETL tools that help to extract the data from on prem to cloud storages.
 
Once in cloud storage, Databricks can read the data using Spark APIs like spark.read.parquet("s3://path/to/data").However, if "offline" refers to an on-premises CDH environment without public internet access but with potential private network connectivity to the cloud (a common enterprise setup), Databricks can read data directly via network integration.
 
This involves configuring secure connectivity between your cloud provider (AWS, Azure, or GCP) and the on-prem network, then using Spark to access HDFS paths.
 
I hope this would help you to understand your question. ( I am open to other solutions)

yinan
New Contributor II

1. Is it that the free version of the SQL editor cannot connect to HDFS?
2. Is it also that the free version of the notebook cannot be used? I created a free version, and it keeps spinning;
3. Does the current free version of Databricks not support linking to an offline cluster's HDFS (assuming the network is all set up)?
4. I saw in the free version that it says custom workspace storage locations are not supported. Does this mean I cannot choose other storage spaces and can only use ADLS?

 

Hello @yinan 
Good day!!
Thank you for your response. 

Here are you solutions for your questions 

1. Is it that the free version of the SQL editor cannot connect to HDFS?

The Databricks Free Edition (which replaced the Community Edition) does support using the SQL editor for querying and analyzing data, but it has limitations. The SQL editor relies on a single, small-sized SQL warehouse (limited to a 2X-Small cluster size). While it can access data registered in Unity Catalog or default storage, it cannot directly connect to external on-premises HDFS due to the absence of private networking configurations in the Free Edition. 

2. Is it also that the free version of the notebook cannot be used? I created a free version, and it keeps spinning;

No, the Free Edition does support notebooks, and they can be created and used via the limited all-purpose serverless compute (restricted to small cluster sizes). Yes, even I am trying since morning, the cluster is just spinning may be after few hours it will work normally. 

3. Does the current free version of Databricks not support linking to an offline cluster's HDFS (assuming the network is all set up)?

Correctโ€”the current Free Edition does not support direct linking to an on-premises (offline) cluster's HDFS, even if the network is theoretically set up on your end. This is because the Free Edition operates in a managed, shared workspace without private networking options (e.g., no VNet peering, ExpressRoute, or VPC configurations). On-prem HDFS access would require secure private connectivity(like VPN or expressrouting) and custom compute setups, which aren't available in the serverless-only Free Edition. But you can try to upload/import the data manually on databricks community edition. 

 

4. I saw in the free version that it says custom workspace storage locations are not supported. Does this mean I cannot choose other storage spaces and can only use ADLS?

Yes, the message about custom workspace storage locations not being supported means you cannot configure or choose alternative storage for the workspace root (e.g., a custom ADLS container or other cloud storage).

BUT if you are student or got any debit/credit card, you can get one month of free (for students one year) access from azure subscription. You can go the Azure portal and register yourself plus azure also provides 14 days of free databricks access plus azure provides approx 200$ credit value to use and practice for only one month which means that you can mount the azure storages on databricks. 

IF you find this answer useful, please select my solution as solution for the question. 

Thank you. 

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local communityโ€”sign up today to get started!

Sign Up Now