cancel
Showing results for 
Search instead for 
Did you mean: 
Machine Learning
cancel
Showing results for 
Search instead for 
Did you mean: 

Spark with LSTM

imgaboy
New Contributor III

I am still lost on the Spark and Deep Learning model.

If I have a (2D) time series that I want to use for e.g. an LSTM model. Then I first convert it to a 3D array and then pass it to the model. This is normally done in memory with numpy. But what happens when I manage my BIG file with Spark? The solutions I've seen so far all do it by working with Spark and then converting the 3D data in numpy at the end. And that puts everything in memory.... or am I thinking wrong? 

A common Spark LSTM solution is looks like this:

# create fake dataset
import random 
from keras import models
from keras import layers
 
 
 
data = []
for node in range(0,100):
    for day in range(0,100):
        data.append([str(node),
                     day,
                     random.randrange(15, 25, 1),
                     random.randrange(50, 100, 1),
                     random.randrange(1000, 1045, 1)])
        
df = spark.createDataFrame(data,['Node', 'day','Temp','hum','press'])
 
# transform the data
df_trans  = df.groupBy('day').pivot('Node').sum()
df_trans = df_trans.orderBy(['day'], ascending=True)
 
#make tran/test data
trainDF = df_trans[df_trans.day < 70]
testDF = df_trans[df_trans.day > 70]
 
 
################## we lost the SPARK #############################
# create train/test array
trainArray = np.array(trainDF.select(trainDF.columns).collect())
testArray = np.array(testDF.select(trainDF.columns).collect())
 
# drop the target columns
xtrain = trainArray[:, 0:-1]
xtest = testArray[:, 0:-1]
# take the target column
ytrain = trainArray[:, -1:]
ytest = testArray[:, -1:]
 
# reshape 2D to 3D
xtrain = xtrain.reshape((xtrain.shape[0], 1, xtrain.shape[1]))
xtest = xtest.reshape((xtest.shape[0], 1, xtest.shape[1]))
 
# build the model
model = models.Sequential()
model.add(layers.LSTM(1, input_shape=(1,400)))
model.add(layers.Dense(1))
model.compile(loss='mean_squared_error', optimizer='adam')
 
 
# train the model
loss = model.fit(xtrain, ytrain, batch_size=10, epochs=100)

My problem with this is: if my Spark data uses millions of rows and thousands of columns, then when the # create train/test array program line tries to transform the data, it causes a memory overflow. Am I right?

My question is: can SPARK be used to train LSTM models on big data, or is it not possible? 

Is there any Generator function that can solve this? Like the Keras Generator function?

8 REPLIES 8

Hubert-Dudek
Esteemed Contributor III

As above is not Deep Learning model maybe you can save your dataframe (after pivot) as table and than experiment with AutoML - it generate nice code (sadly no deep learning library there) so than you can experiment with more advanced solutions.

If you need easy implantation of some deep learning models nice solution to start is also to use Azure Synapse models inside databricks https://docs.microsoft.com/en-us/azure/synapse-analytics/machine-learning/tutorial-build-application...

imgaboy
New Contributor III

Ok. I have created a small and simple example.... But I want to use DeepAR and Temporal Fusion Transformer model on 6.000.000 row x 155 columns dataset.

It is still not clear that when I want to create data chunks, e.g. I want to predict the next day's data based on the previous week's data (7 days), how do I do this with Spark?

With normal pilinle, it looks like I'm making the 2D data into 3D, using numpy, or e.g. tensorflow.keras.preprocessing.timeseries_dataset_from_array.

imgaboy
New Contributor III

After the pivot, I have a table with 30.000 rows and 16.000 columns. PySpark does not "love" that amount of columns......

Hubert-Dudek
Esteemed Contributor III

but this amount of column will be two much for any ML model (in my opinion). Maybe you can somehow merge them with some logic (like adding all numbers or something what make sense)

imgaboy
New Contributor III

A simple DL model like an image classifier uses more than 65.000-pixel point as an input...So it is not a mutch data for Neural Network....

Hubert-Dudek
Esteemed Contributor III

Data not but number of columns. There is also Array type in spark. Maybe with that it would work better.​

imgaboy
New Contributor III

My problem is how to add/transform this dataset to the model from Spark data frame, without reading all data to the memory. The sample code above shows that problem...

__paolo__
New Contributor II

Hi!

I guess you've already solved this issue (your question has been posted more than 1 year ago), but maybe you could be interested in reading

https://learn.microsoft.com/en-gb/azure/databricks/machine-learning/train-model/dl-best-practices

There are some directions/best practises as well as example notebooks.

Paolo

paolo例
Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.