Databricks Community

Orianh · ‎11-14-2021

Hey guys,

I'm trying to train deep learning model at ML databricks with numpy arrays as input.

For now i organized all the data inside DF- df contains 4 columns : col1,col2,col3,col4

col1 and col2 have arrays with shape (1,3,3,3,3), col 3 have array with shpe (1,3,3,3) and col4 is a float number.

As you know, pyspark df cant save np arrays as values so i tried three approaches, The first is to save the arrays as binary data and the second is to save as list and when load the data change to np array and reshape it, the third approach is to change the batch spark df into Pandas df and use np.stack on each column in it, which gave the fastest results.

after i have the dataframe where each row represent a set of arrays i want to make a 24 size batch, which means now i will have 4 np arrays, for col1 and col 2 arrays with shape (24,3,3,3,3) for col3 (24,3,3,3) and 1D array with 24 floating points. (each array is combination of 24 rows)

When tried to collect batch of 24 arrays for col2 its took a lot of time -- x10 from col1, and with the lists the represent the array the collect happened faster.

So i have few questions.

Does any one have a good idea on how to save all this data without pay a lot of time when i want the model to consume it (e.g the collect of the arrays, and reshaping each list for the wanted size ).

And second, does any one have any better way to do what im trying to achieve?

I dont mind to pay alot on preprocess, but i want the training to be quick and spend minimum time on data preparing. (I saw examples on 1 image as input but not on 4D and 5D np arrays)

Hope you can help me.

Thanks a lot!

Orianh · ‎11-17-2021

Hey, I didn't made any progress. Hope you can help me😀

Hubert-Dudek · ‎11-15-2021

Maybe you could save some your code. It will be easier to answer and also we could learn deep learning in databricks from your code.

Orianh · ‎11-15-2021

At the moment im just trying to preprocess my data and do it in efficient and quick way,

So there isnt any code of deep learning.

I didn't find if or how i can to train my model with DF as input since df dont accept np arrays as data type.( there are examples for images dataframe from databricks)

I read npz files from S3 bucket as binary, after its used udf to use np.load on the binary content and split the data to rows.

When im trying to get the np arrays from the df ( which saved now as lists) i need to use np.stack and pd.tolist so its take some time.

Im trying to get the data with less then 1 sec for quick training and minimum io waste.

Databricks Community

Train deep learning model with numpy arrays.

Connect with Databricks Users in Your Area

Intelligent Data Warehousing: AI/BI for Self-service Analytics

Introducing SAP Databricks

Serverless Compute for Notebooks, Workflows and Pipelines is now Generally Available on Google Cloud

Welcoming BladeBridge to Databricks: Accelerating Data Warehouse Migrations to Lakehouse

Databricks Clean Rooms: Now Generally Available on AWS and Azure