cancel
Showing results for 
Search instead for 
Did you mean: 
Machine Learning
Dive into the world of machine learning on the Databricks platform. Explore discussions on algorithms, model training, deployment, and more. Connect with ML enthusiasts and experts.
cancel
Showing results for 
Search instead for 
Did you mean: 

Spacy Retraining failure

AndersenHuang
New Contributor

Hello,

 

I'm having problems trying to run my retraining notebook for a spacy model. The notebook creates a shell file with the following lines of code:

 

 

 

 cmd = f'''
    awk '{{sub("source = ","source = /dbfs/FileStore/{dbfs_folder}/textcat/categories/model_{model_id}/model-best")}}1' config_.cfg > /dbfs/FileStore/{dbfs_folder}/temp/config.cfg
    '''
    
    f.write(cmd)
    
    cmd = f'''
    python -m spacy train /dbfs/FileStore/{dbfs_folder}/temp/config.cfg --output "/dbfs/FileStore/{dbfs_folder}/temp/model2" --paths.train "/dbfs/FileStore/{dbfs_folder}/temp/train_imbalanced.spacy" --paths.dev "/dbfs/FileStore/{dbfs_folder}/temp/eval.spacy"


f.write(cmd)
    
    cmd = f'''
    awk '{{sub("source = ","source = /dbfs/FileStore/{dbfs_folder}/temp/model2/model-best")}}1' config_.cfg > /dbfs/FileStore/{dbfs_folder}/temp/config.cfg
    '''
    
    f.write(cmd)
    
    cmd = f'''
    python -m spacy train /dbfs/FileStore/{dbfs_folder}/temp/config.cfg --output "/dbfs/FileStore/{dbfs_folder}/temp/model2" --paths.train "/dbfs/FileStore/{dbfs_folder}/temp/train_semibalanced.spacy" --paths.dev "/dbfs/FileStore/{dbfs_folder}/temp/eval.spacy"
    '''
    
    f.write(cmd)
    
    cmd = f'''
    python -m spacy train /dbfs/FileStore/{dbfs_folder}/temp/config.cfg --output "/dbfs/FileStore/{dbfs_folder}/temp/model2" --paths.train "/dbfs/FileStore/{dbfs_folder}/temp/train_balanced.spacy" --paths.dev "/dbfs/FileStore/{dbfs_folder}/temp/eval.spacy"
    '''
    
    f.write(cmd)

 

 

Then runs it with %sh /dbfs/FileStore/"$fld"/temp/train.sh

From what I am able to tell, spacy train uses shutil.copytree, which doesn't seems to be working anymore when I try to use it on files stored in dbfs. It returns the error 

 

 

shutil.Error: [('/dbfs/FileStore/Prod/temp/model2/model-last/config.cfg', '/dbfs/FileStore/Prod/temp/model2/model-best/config.cfg', '[Errno 1] Operation not permitted')

 

 

for each file in the tree. This notebook was working the last time we ran it, which was about 10 months ago. Any ideas what could be going wrong?

 

1 REPLY 1

Kumaran
Databricks Employee
Databricks Employee

Hi @AndersenHuang,

Thank you for contacting Databricks community support.

The error message you're encountering suggests that there's a permission issue when trying to copy the files. It's possible that the permissions for the directory /dbfs/FileStore/Prod/temp/model2/ have changed, or that there's been an update to the shutil library or the way it interacts with the DBFS file system.

One possible solution is to use the dbutils.fs.cp command instead of shutil.copytree to copy the files. dbutils.fs.cp is a Databricks utility function that can be used to copy files within the DBFS file system. Here's an example of how you could modify your code to use dbutils.fs.cp:

 

 
1dbutils.fs.cp("/dbfs/FileStore/Prod/temp/model2/model-last/config.cfg", "/dbfs/FileStore/Prod/temp/model2/model-best/config.cfg")

Another possible solution is to check the permissions of the directory /dbfs/FileStore/Prod/temp/model2/ and make sure that the user running the notebook has the necessary permissions to read and write to the directory.

It's also possible that there's been an update to the shutil library or the way it interacts with the DBFS file system that's causing the issue. You could try using an older version of the shutil library to see if that resolves the issue.

Finally, it's worth noting that the error message you're encountering is not specific to Spacy, but rather to the shutil library and the way it interacts with the DBFS file system. Therefore, the solution to this issue may not be specific to Spacy either.

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group