Hey guys,
I'm trying to train deep learning model at ML databricks with numpy arrays as input.
For now i organized all the data inside DF- df contains 4 columns : col1,col2,col3,col4
col1 and col2 have arrays with shape (1,3,3,3,3), col 3 have array with shpe (1,3,3,3) and col4 is a float number.
As you know, pyspark df cant save np arrays as values so i tried three approaches, The first is to save the arrays as binary data and the second is to save as list and when load the data change to np array and reshape it, the third approach is to change the batch spark df into Pandas df and use np.stack on each column in it, which gave the fastest results.
after i have the dataframe where each row represent a set of arrays i want to make a 24 size batch, which means now i will have 4 np arrays, for col1 and col 2 arrays with shape (24,3,3,3,3) for col3 (24,3,3,3) and 1D array with 24 floating points. (each array is combination of 24 rows)
When tried to collect batch of 24 arrays for col2 its took a lot of time -- x10 from col1, and with the lists the represent the array the collect happened faster.
So i have few questions.
Does any one have a good idea on how to save all this data without pay a lot of time when i want the model to consume it (e.g the collect of the arrays, and reshaping each list for the wanted size ).
And second, does any one have any better way to do what im trying to achieve?
I dont mind to pay alot on preprocess, but i want the training to be quick and spend minimum time on data preparing. (I saw examples on 1 image as input but not on 4D and 5D np arrays)
Hope you can help me.
Thanks a lot!