Databricks Community

imgaboy · ‎03-10-2022

I am still lost on the Spark and Deep Learning model.

If I have a (2D) time series that I want to use for e.g. an LSTM model. Then I first convert it to a 3D array and then pass it to the model. This is normally done in memory with numpy. But what happens when I manage my BIG file with Spark? The solutions I've seen so far all do it by working with Spark and then converting the 3D data in numpy at the end. And that puts everything in memory.... or am I thinking wrong?

A common Spark LSTM solution is looks like this:

# create fake dataset
import random 
from keras import models
from keras import layers
 
 
 
data = []
for node in range(0,100):
    for day in range(0,100):
        data.append([str(node),
                     day,
                     random.randrange(15, 25, 1),
                     random.randrange(50, 100, 1),
                     random.randrange(1000, 1045, 1)])
        
df = spark.createDataFrame(data,['Node', 'day','Temp','hum','press'])
 
# transform the data
df_trans  = df.groupBy('day').pivot('Node').sum()
df_trans = df_trans.orderBy(['day'], ascending=True)
 
#make tran/test data
trainDF = df_trans[df_trans.day < 70]
testDF = df_trans[df_trans.day > 70]
 
 
################## we lost the SPARK #############################
# create train/test array
trainArray = np.array(trainDF.select(trainDF.columns).collect())
testArray = np.array(testDF.select(trainDF.columns).collect())
 
# drop the target columns
xtrain = trainArray[:, 0:-1]
xtest = testArray[:, 0:-1]
# take the target column
ytrain = trainArray[:, -1:]
ytest = testArray[:, -1:]
 
# reshape 2D to 3D
xtrain = xtrain.reshape((xtrain.shape[0], 1, xtrain.shape[1]))
xtest = xtest.reshape((xtest.shape[0], 1, xtest.shape[1]))
 
# build the model
model = models.Sequential()
model.add(layers.LSTM(1, input_shape=(1,400)))
model.add(layers.Dense(1))
model.compile(loss='mean_squared_error', optimizer='adam')
 
 
# train the model
loss = model.fit(xtrain, ytrain, batch_size=10, epochs=100)

My problem with this is: if my Spark data uses millions of rows and thousands of columns, then when the # create train/test array program line tries to transform the data, it causes a memory overflow. Am I right?

My question is: can SPARK be used to train LSTM models on big data, or is it not possible?

Is there any Generator function that can solve this? Like the Keras Generator function?

Hubert-Dudek · ‎03-10-2022

As above is not Deep Learning model maybe you can save your dataframe (after pivot) as table and than experiment with AutoML - it generate nice code (sadly no deep learning library there) so than you can experiment with more advanced solutions.

If you need easy implantation of some deep learning models nice solution to start is also to use Azure Synapse models inside databricks https://docs.microsoft.com/en-us/azure/synapse-analytics/machine-learning/tutorial-build-application...

imgaboy · ‎03-10-2022

Ok. I have created a small and simple example.... But I want to use DeepAR and Temporal Fusion Transformer model on 6.000.000 row x 155 columns dataset.

It is still not clear that when I want to create data chunks, e.g. I want to predict the next day's data based on the previous week's data (7 days), how do I do this with Spark?

With normal pilinle, it looks like I'm making the 2D data into 3D, using numpy, or e.g. tensorflow.keras.preprocessing.timeseries_dataset_from_array.

imgaboy · ‎03-10-2022

After the pivot, I have a table with 30.000 rows and 16.000 columns. PySpark does not "love" that amount of columns......

Hubert-Dudek · ‎03-10-2022

but this amount of column will be two much for any ML model (in my opinion). Maybe you can somehow merge them with some logic (like adding all numbers or something what make sense)

imgaboy · ‎03-10-2022

A simple DL model like an image classifier uses more than 65.000-pixel point as an input...So it is not a mutch data for Neural Network....

Hubert-Dudek · ‎03-10-2022

Data not but number of columns. There is also Array type in spark. Maybe with that it would work better.

imgaboy · ‎03-10-2022

My problem is how to add/transform this dataset to the model from Spark data frame, without reading all data to the memory. The sample code above shows that problem...