cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

Train deep learning model with numpy arrays.

Orianh
Valued Contributor II

Hey guys,

I'm trying to train deep learning model at ML databricks with numpy arrays as input.

For now i organized all the data inside DF- df contains 4 columns : col1,col2,col3,col4

col1 and col2 have arrays with shape (1,3,3,3,3), col 3 have array with shpe (1,3,3,3) and col4 is a float number.

As you know, pyspark df cant save np arrays as values so i tried three approaches, The first is to save the arrays as binary data and the second is to save as list and when load the data change to np array and reshape it, the third approach is to change the batch spark df into Pandas df and use np.stack on each column in it, which gave the fastest results.

after i have the dataframe where each row represent a set of arrays i want to make a 24 size batch, which means now i will have 4 np arrays, for col1 and col 2 arrays with shape (24,3,3,3,3) for col3 (24,3,3,3) and 1D array with 24 floating points. (each array is combination of 24 rows)

When tried to collect batch of 24 arrays for col2 its took a lot of time -- x10 from col1, and with the lists the represent the array the collect happened faster.

So i have few questions.

Does any one have a good idea on how to save all this data without pay a lot of time when i want the model to consume it (e.g the collect of the arrays, and reshaping each list for the wanted size ).

And second, does any one have any better way to do what im trying to achieve?

I dont mind to pay alot on preprocess, but i want the training to be quick and spend minimum time on data preparing. (I saw examples on 1 image as input but not on 4D and 5D np arrays)

Hope you can help me.

Thanks a lot!

5 REPLIES 5

Kaniz
Community Manager
Community Manager

Hi @ Orianh! My name is Kaniz, and I'm the technical moderator here. Great to meet you, and thanks for your question! Let's see if your peers in the community have an answer to your question first. Or else I will get back to you soon. Thanks.

Orianh
Valued Contributor II

Hey, I didn't made any progress. Hope you can help me😀

Hubert-Dudek
Esteemed Contributor III

Maybe you could save some your code. It will be easier to answer and also we could learn deep learning in databricks from your code.

Orianh
Valued Contributor II

At the moment im just trying to preprocess my data and do it in efficient and quick way,

So there isnt any code of deep learning.

I didn't find if or how i can to train my model with DF as input since df dont accept np arrays as data type.( there are examples for images dataframe from databricks)

I read npz files from S3 bucket as binary, after its used udf to use np.load on the binary content and split the data to rows.

When im trying to get the np arrays from the df ( which saved now as lists) i need to use np.stack and pd.tolist so its take some time.

Im trying to get the data with less then 1 sec for quick training and minimum io waste.

Kaniz
Community Manager
Community Manager
Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.