Databricks Community

jonathan-dufaul · ‎01-25-2023

I'm sorry if this is a bad question. The tl;dr is

are there any concrete examples of a nosql data science workflows specifically in databricks and if so what are they?
is it always the case that our end goal is a dataframe?

For us we start as a bunch of parquet files in the azure blob storage, and then construct a hive metastore on top of that, and from there in either pyspark or spark sql, it behaves like a traditional rdbms. I think this counts as sql, right? or if there is nosql, our goal is to turn the data into a sql format as quick as possible?

If I started with data in a document store format in azure blob storage or connected to mongo, does the downstream after reading the raw data change? I'm visualizing the current process as:

[raw data] -> [transform data] -> [clean/standardized data] -> [training/selection/deployment/anything after]

If this is still relevant with a document store database, does [clean/standardized] step always be a dataframe, or is a dataframe just one of the possible inputs to the machine learning process? If so, how common is a dataframe as an input instead of another format? Any concrete example of a workflow with nosql would be extremely helpful.

for scoring I'd imagine a document store would be like the ideal format as an input.

My background is in statistics so I've always gotten a clean table as an input, so in my job right now my conception has always been "get to a clean table and then do data science on that." I'm just wondering if that's too narrow a view on how the data can go.

I've searched so many permutations on the key words and am getting nowhere.