cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

run md5 using CLI

pshuk
New Contributor III

Hi,

I want to run a md5 checksum on the uploaded file to databricks. I can generate md5 on the local file but how do I generate one on uploaded file on databricks using CLI (Command line interface). Any help would be appreciated.

I tried running databricks fs md5 but it shows that md5 is not supported. 

2 REPLIES 2

Kaniz
Community Manager
Community Manager

Hi @pshukUnfortunately, the databricks fs md5 command is not supported directly. 

  1. You can run a Python script to compute the MD5 hash of the uploaded file.
  2. If your uploaded file is stored in Azure Blob Storage, you can use the azcopy tool to calculate the MD5 hash and set it as the Content-MD5 property of the blob.
  3. If you’re using Databricks on AWS, you can use the AWS CLI to upload the file to an S3 bucket and include the MD5 checksum in the metadata. 

pshuk
New Contributor III

Thanks Kaniz. I do get the MD5 hash of the file locally and then I upload it to Databricks Volume. I suppose it is Delta Lake Gen 2 storage type, but I am not able to generate MD5 using my code (running on local machine) of this uploaded file. 

If we take a step back, the only reason I am doing MD5 checksum is to check the data integrity. If there is any other way, I can confirm that uploaded file from on-prem to Databricks volume is exactly same, then my problem would be solved. Any idea/suggestions?