cancel
Showing results for 
Search instead for 
Did you mean: 
Machine Learning
Dive into the world of machine learning on the Databricks platform. Explore discussions on algorithms, model training, deployment, and more. Connect with ML enthusiasts and experts.
cancel
Showing results for 
Search instead for 
Did you mean: 

how does the data science workflow change in databricks if you start with a nosql database (specifically document store) instead of something more traditional/rdbms type source?

jonathan-dufaul
Valued Contributor

I'm sorry if this is a bad question. The tl;dr is

  1. are there any concrete examples of a nosql data science workflows specifically in databricks and if so what are they?
  2. is it always the case that our end goal is a dataframe?

For us we start as a bunch of parquet files in the azure blob storage, and then construct a hive metastore on top of that, and from there in either pyspark or spark sql, it behaves like a traditional rdbms. I think this counts as sql, right? or if there is nosql, our goal is to turn the data into a sql format as quick as possible?

If I started with data in a document store format in azure blob storage or connected to mongo, does the downstream after reading the raw data change? I'm visualizing the current process as:

[raw data] -> [transform data] -> [clean/standardized data] -> [training/selection/deployment/anything after]

If this is still relevant with a document store database, does [clean/standardized] step always be a dataframe, or is a dataframe just one of the possible inputs to the machine learning process? If so, how common is a dataframe as an input instead of another format? Any concrete example of a workflow with nosql would be extremely helpful.

for scoring I'd imagine a document store would be like the ideal format as an input.

My background is in statistics so I've always gotten a clean table as an input, so in my job right now my conception has always been "get to a clean table and then do data science on that." I'm just wondering if that's too narrow a view on how the data can go.

I've searched so many permutations on the key words and am getting nowhere.

1 REPLY 1

Nhan_Nguyen
Valued Contributor

Nice sharing, thanks!

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group