Urgent - Use Python Variable in shell command in databricks notebook
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
01-12-2023 07:10 AM
I am trying to read a csv and do an activity from azure storage account using databricks shell script. I wanted to add this shell script into my big python code for other sources as well. I have created widgets for file path in python. I have created variables named file_datepath and get widgets data into it. I want to read csv from azure storage account using file_datepath variable in a shell command.
I tried below, its not working.
dbutils.widgets.text("source_storage_path", "datafolder/data/")
dbutils.widgets.text("file_datepath", "2022/12/")
source_storage_path = dbutils.widgets.get("source_storage_path")
file_datepath = dbutils.widgets.get("file_datepath")
%sh tr '\n' ' ' <'/dbfs/mnt/{source_storage_path}/{file_datepath}/*.csv' > '/dbfs/mnt/{source_storage_path}/{file_datepath}/*_new.csv'
but this is giving me error no such file or directry. I tried with $ and everything. Nothing seems working. Please help @Sherinus @Hubert Dudek @Werner Stinckens
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
01-12-2023 07:33 AM
Yea I don't think it will work.
When you execute a shell command, you are working on the OS level. Linux does not know about DBFS or Azure or AWS or ...
If you want to do this, you have to mount the data lake in Linux, and this is not that easy.
Databricks (or the Spark application) bootstraps all this and gives you dbfs so you do not have to worry about connectivity.
May I ask why you want to do this with a shell command?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
01-12-2023 08:39 AM
Hi Werners,
I have a csv which has double dagger delimitter and UTF-16 encoding. It has extra lines and spaces Some rows ends with CRLF and some ends with LF. So, I have created a shell script to handle this. Now, I wanted to integrate this shell script with my bigger python commands.
%sh tr '\n' ' ' <'/dbfs/mnt/datalake/data/file.csv' > '/dbfs/mnt/datalake/data/file_new.csv'
dff = spark.read.option("header", "true") \
.option("inferSchema", "true") \
.option('encoding', 'UTF-16') \
.option("delimiter", "‡‡,‡‡") \
.option("multiLine", True) \
.csv("/mnt/datalake/data/file_new.csv")
dffs_headers = dff.dtypes
for i in dffs_headers:
columnLabel = i[0]
newColumnLabel = columnLabel.replace('‡‡','').replace('‡‡','')
dff=dff.withColumn(newColumnLabel,regexp_replace(columnLabel,'^\\‡‡|\\‡‡$|\\ ‡‡',''))
if columnLabel != newColumnLabel:
dff = dff.drop(columnLabel)
#[display(dff)
display(dff)]
Now, I want to parameterise every path thats why I wrote the widgets, and get widgets etc
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
01-13-2023 12:11 AM
No need for a shell script. With spark and regex you can handle the most messed up files.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
04-05-2023 01:31 AM
You can mount the storage account and then can set env level variable and can do the operation that you want.

