03-10-2022 12:58 AM
I am still lost on the Spark and Deep Learning model.
If I have a (2D) time series that I want to use for e.g. an LSTM model. Then I first convert it to a 3D array and then pass it to the model. This is normally done in memory with numpy. But what happens when I manage my BIG file with Spark? The solutions I've seen so far all do it by working with Spark and then converting the 3D data in numpy at the end. And that puts everything in memory.... or am I thinking wrong?
A common Spark LSTM solution is looks like this:
# create fake dataset
import random
from keras import models
from keras import layers
data = []
for node in range(0,100):
for day in range(0,100):
data.append([str(node),
day,
random.randrange(15, 25, 1),
random.randrange(50, 100, 1),
random.randrange(1000, 1045, 1)])
df = spark.createDataFrame(data,['Node', 'day','Temp','hum','press'])
# transform the data
df_trans = df.groupBy('day').pivot('Node').sum()
df_trans = df_trans.orderBy(['day'], ascending=True)
#make tran/test data
trainDF = df_trans[df_trans.day < 70]
testDF = df_trans[df_trans.day > 70]
################## we lost the SPARK #############################
# create train/test array
trainArray = np.array(trainDF.select(trainDF.columns).collect())
testArray = np.array(testDF.select(trainDF.columns).collect())
# drop the target columns
xtrain = trainArray[:, 0:-1]
xtest = testArray[:, 0:-1]
# take the target column
ytrain = trainArray[:, -1:]
ytest = testArray[:, -1:]
# reshape 2D to 3D
xtrain = xtrain.reshape((xtrain.shape[0], 1, xtrain.shape[1]))
xtest = xtest.reshape((xtest.shape[0], 1, xtest.shape[1]))
# build the model
model = models.Sequential()
model.add(layers.LSTM(1, input_shape=(1,400)))
model.add(layers.Dense(1))
model.compile(loss='mean_squared_error', optimizer='adam')
# train the model
loss = model.fit(xtrain, ytrain, batch_size=10, epochs=100)
My problem with this is: if my Spark data uses millions of rows and thousands of columns, then when the # create train/test array program line tries to transform the data, it causes a memory overflow. Am I right?
My question is: can SPARK be used to train LSTM models on big data, or is it not possible?
Is there any Generator function that can solve this? Like the Keras Generator function?
03-10-2022 05:25 AM
As above is not Deep Learning model maybe you can save your dataframe (after pivot) as table and than experiment with AutoML - it generate nice code (sadly no deep learning library there) so than you can experiment with more advanced solutions.
If you need easy implantation of some deep learning models nice solution to start is also to use Azure Synapse models inside databricks https://docs.microsoft.com/en-us/azure/synapse-analytics/machine-learning/tutorial-build-application...
03-10-2022 05:44 AM
Ok. I have created a small and simple example.... But I want to use DeepAR and Temporal Fusion Transformer model on 6.000.000 row x 155 columns dataset.
It is still not clear that when I want to create data chunks, e.g. I want to predict the next day's data based on the previous week's data (7 days), how do I do this with Spark?
With normal pilinle, it looks like I'm making the 2D data into 3D, using numpy, or e.g. tensorflow.keras.preprocessing.timeseries_dataset_from_array.
03-10-2022 05:45 AM
After the pivot, I have a table with 30.000 rows and 16.000 columns. PySpark does not "love" that amount of columns......
03-10-2022 05:47 AM
but this amount of column will be two much for any ML model (in my opinion). Maybe you can somehow merge them with some logic (like adding all numbers or something what make sense)
03-10-2022 06:08 AM
A simple DL model like an image classifier uses more than 65.000-pixel point as an input...So it is not a mutch data for Neural Network....
03-10-2022 06:26 AM
Data not but number of columns. There is also Array type in spark. Maybe with that it would work better.
03-10-2022 06:12 AM
My problem is how to add/transform this dataset to the model from Spark data frame, without reading all data to the memory. The sample code above shows that problem...
08-18-2023 01:41 AM
Hi!
I guess you've already solved this issue (your question has been posted more than 1 year ago), but maybe you could be interested in reading
https://learn.microsoft.com/en-gb/azure/databricks/machine-learning/train-model/dl-best-practices
There are some directions/best practises as well as example notebooks.
Paolo
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.
If there isn’t a group near you, start one and help create a community that brings people together.
Request a New Group