cancel
Showing results for 
Search instead for 
Did you mean: 
Machine Learning
cancel
Showing results for 
Search instead for 
Did you mean: 

how does the data science workflow change in databricks if you start with a nosql database (specifically document store) instead of something more traditional/rdbms type source?

jonathan-dufaul
Valued Contributor

I'm sorry if this is a bad question. The tl;dr is

  1. are there any concrete examples of a nosql data science workflows specifically in databricks and if so what are they?
  2. is it always the case that our end goal is a dataframe?

For us we start as a bunch of parquet files in the azure blob storage, and then construct a hive metastore on top of that, and from there in either pyspark or spark sql, it behaves like a traditional rdbms. I think this counts as sql, right? or if there is nosql, our goal is to turn the data into a sql format as quick as possible?

If I started with data in a document store format in azure blob storage or connected to mongo, does the downstream after reading the raw data change? I'm visualizing the current process as:

[raw data] -> [transform data] -> [clean/standardized data] -> [training/selection/deployment/anything after]

If this is still relevant with a document store database, does [clean/standardized] step always be a dataframe, or is a dataframe just one of the possible inputs to the machine learning process? If so, how common is a dataframe as an input instead of another format? Any concrete example of a workflow with nosql would be extremely helpful.

for scoring I'd imagine a document store would be like the ideal format as an input.

My background is in statistics so I've always gotten a clean table as an input, so in my job right now my conception has always been "get to a clean table and then do data science on that." I'm just wondering if that's too narrow a view on how the data can go.

I've searched so many permutations on the key words and am getting nowhere.

1 REPLY 1

Nhan_Nguyen
Valued Contributor

Nice sharing, thanks!

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.