cancel
Showing results for 
Search instead for 
Did you mean: 
Machine Learning
cancel
Showing results for 
Search instead for 
Did you mean: 

TF SummaryWriter flush() don't send any buffered data to storage.

Orianh
Valued Contributor II

Hey guys,

I'm training a TF model in databricks, and logging to tensorboard using SummaryWriter.

At the end of each epoch SummaryWriter.flush() is called which should send any buffered data into storage.

But i can't see the tensorboard files while the model is still training, does any one know why i can't view the logs while the model is training?

I created the SummaryWriter using tf.summary.create_file_writer('/dbfs/FileStore/..)

Does it possible to view the data while the model is training? or there is problem to log the data straight to dbfs path? and i should log it into the cluster and then copy the data into dbfs ?

Hope some one can make this more clear for me,

Thanks!

2 REPLIES 2

Anonymous
Not applicable

@orian hindi​ : I shall provide you a framework to test and try and please see if it works out for you!

  1. Make sure that the log directory specified in SummaryWriter is valid and accessible. In Databricks, you should use the Databricks File System (DBFS) path instead of a local file system path. You can use a path like /dbfs/<your-directory>/ to log your TensorBoard data to DBFS.
  2. Check the permissions of the directory you are logging to. Make sure that the user running the training job has write permissions to the directory.
  3. Ensure that you are calling SummaryWriter.flush() after each epoch to write the buffered data to disk.
  4. Make sure that you have the correct port number when launching TensorBoard. In Databricks, the default port number for TensorBoard is 6006.
  5. Verify that you are pointing TensorBoard to the correct log directory when launching it. You can use a command like tensorboard --logdir=/dbfs/<your-directory>/ to launch TensorBoard and view your logs.

If you are still having trouble, you may want to try logging your TensorBoard data to the local file system of the cluster, and then copying it to DBFS after the training job has completed. This can help ensure that the logs are being written correctly, and can also make it easier to organize and analyze the data.

Anonymous
Not applicable

Hi @orian hindi​ 

Hope everything is going great.

Just wanted to check in if you were able to resolve your issue. If yes, would you be happy to mark an answer as best so that other members can find the solution more quickly? If not, please tell us so we can help you. 

Cheers!

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.