cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Databricks connection to on-prem cloudera

ak5har
New Contributor II

Hello, 

    we are trying to evaluate Databricks solution to extract the data from existing cloudera schema hosted on physical server. We are using the Databricks serverless compute provided by databricks express setup and we assume we will not need the aws storage bucket for this( Please correct me if I am wrong) as the data will be hosted internally on databricks.

Can you please provide the steps or documentations how we can connect databricks to cloudera ( on-prem) to proceed with ETL in our scenario. 

 

Also to read the data from databricks we have a dashboard application which is also on a physical server already built ,our assumption is that we just need to include the UC REST APIs to read the data from databricks into this application. Please evaluate if our assumption is correct and provide any pointers to achieve this. 

Highly appreciate your response.

thanks

7 REPLIES 7

MariuszK
Contributor III

Hello,

Are you planing to move data from Cloudera to Databricks?
Databricks provides compute, data you need to store data on S3. I didn't try but assume that it's possible to connect to Cloudera, but in this case you use compute from Cloudera and Databricks.

ak5har
New Contributor II

Thanks for your response. Yes we are evaluating so depending upon the result we make a decision.  we just need to understand how do to the configurations to connect to an on-prem hosted cloudera data from Databricks so that we can perform ETL to Databricks. 

Rjdudley
Honored Contributor

Is the Cloudera to Databricks extraction a one-time migration, or would this be something you do regularly?  That would change the strategy.

 we assume we will not need the aws storage bucket for this( Please correct me if I am wrong)

You still need S3 buckets.  With Databricks, you are responsible for all your data storage, which is different than the other SaaS platforms.  This means however that your data stays in your accounts, which InfoSec people tend to like.

our assumption is that we just need to include the UC REST APIs to read the data from databricks into this application

Maybe but probably not, depending on how you built it.  You may need to recreate it.  If you used PowerBI or Tableau, you can connect directly using Delta Sharing.

 

You might want to complete the short and free Lakehouse Fundamentals, that discusses some of what you asked about here: Lakehouse Fundamentals | Databricks

 

 

ak5har
New Contributor II

Thanks for your response. 

Is the Cloudera to Databricks extraction a one-time migration, or would this be something you do regularly?  That would change the strategy.

-So we are evaluating the option but it would be onetime to begin with but would be interesting to know how will the strategy differ incase of incremental loads. 

So in case of Databricks express set up , do we still need to have AWS S3 bucket for storage. 

Link -https://docs.databricks.com/aws/en/getting-started/express-setup

Rjdudley
Honored Contributor

interesting to know how will the strategy differ incase of incremental loads. 

There are three data "sites" you need to take into account--your on-prem, your AWS, and Databricks AWS.  Under normal circumstances it is significantly easier to push data from on-prem into the cloud.

If you used HDFS you don't have a lot of options, and on-prem adds a layer of difficulty.  Presumably your InfoSec and networking teams have a lot of protections in place to prevent something from outside your network connecting and grabbing all your data.  It's not impossible, it's just a larger discussion you need to have internally.

For the actual data migration, there are a number of Apache tools with Hadoop integration you can use to build something, and AWS has their versions of some of those, but you'll still need to get the networking figured out.  If you want a product recommendation, CData has one which can sync from HDFS to Databricks: HDFS Integrations: Drivers & Connectors for HDFS.  No experience with this one but overall their stuff is really good.

Adding further complexity is Databricks' serverless compute.  That runs in Databricks' AWS account, not your AWS account.  That's another set of networking permissions.  Databricks classic compute runs in your AWS environment and might be a simpler option depending on what you decide to do.

One question I would ask is, how much of the Cloudera data lake do we want to replicate in the Databricks lakehouse?  If the answer is not much, I would consider loading only the raw layer and processing everything in Databricks.

> So in case of Databricks express set up , do we still need to have AWS S3 bucket for storage. 

Oh yes, and in fact you'll have many S3 buckets.  You may have buckets for landing, and you'll definitely have buckets for data in bronze, silver and gold.  I recommend using one bucket in bronze for each source.  How you organize silver and gold is up to your data governance.

Express setup doesn't build your AWS cloud infrastructure, only sets up your Databricks account and a metastore.  Databricks is a large SaaS platform with a lot of moving pieces, it's not a single-instance database like anything RDS.  Take the time to go through the Lakehouse Fundamentals course.

Rjdudley
Honored Contributor

Also, stay tuned to what happens with the BladeBridge acquisition.  That has connections to Cloudera Impala, and might help your situation.

lorenzo1889
New Contributor II

We are in the same situation. We have a CDH cluster with IaaS architecture. The data are on Hdfs in EC2 disks in AWS and we want to migrate the data from CDH to Databricks in AZURE.

If we federate CDH's HIVE metastore with Databricks, we can migrate the data very fast with incremental queries on SparkSQL on Databricks. What do you think? Is it possible?