07-13-2025 05:17 AM
First of all, Hello Everyone!
This is my first post on here.
For the past few months, I have been working on my personal Microsoft Azure Blog, that includes some Azure Databricks user guides. Sadly, as I am currently searching for work (that is rather challenging in the UK at present), I cannot afford to pay the fees associated with using Microsoft Azure.
Therefore, I decided to create a Databricks Free Edition Account, to see what is achievable, whilst minimising the costs.
Previously, I have used Data Factory to import several CSV files, with different schemas (for example, Customer.csv, Employees.csv) into an Azure Data Lake Storage Gen2 container. Next, then created a mount script, within a Notebook in Azure Databricks to the Data Lake, then created Databricks tables for each CSV file automatically.
My first question is, within Databricks Free Edition, is there any way to create an automated process, that will allow me to import all of the CSV files (e.g. Customer, Employees, Products, Orders etc), stored locally, into newly created Databricks tables please?
I am not sure if there is limited functionality in the Free Edition that would allow me to do this?
If not, I assume I would need to use the Data Ingestion > Create of Modify Table function, and upload each file individually?
Any guidance would be greatly appreciated.
Thanks
Giuseppe
07-13-2025 11:22 AM
@giuseppe_esq
I built out a local solution based off my advice above using the Databricks Python SDK & ChatGPT (of course) 😂. I can confirm that I have been able to upload files from my local storage straight to my free edition databricks environment.
I just needed to install databricks cli and databricks SDK for python. The databricks cli was a couple of commands needed on command prompt. One to install it and another to setup authentication using the databricks cli to my databricks environment. I then created a virtual environment in python (not that you need to) and installed the python SDK.
Example of my python script (locally):
1. Showing it can authenticate to my free edition
2.
Here's the code I used to upload:
from databricks.sdk import WorkspaceClient
from databricks.sdk.service.catalog import VolumeType
import os
# Configuration
local_dir = r"C:\Users\benja\BLANK\Documents\Databricks\DummyData\GenerateData\Dataset.1" # Local CSV directory
volume_path = "/Volumes/dataengineering/dataingestion/cloudfiles/" # Target volume path
catalog = "dataengineering"
schema = "dataingestion"
volume = "cloudfiles"
# Initialize WorkspaceClient (uses %USERPROFILE%\.databrickscfg)
w = WorkspaceClient()
# Step 1: Create volume if it doesn't exist
try:
w.volumes.create(
catalog_name=catalog,
schema_name=schema,
name=volume,
volume_type=VolumeType.MANAGED, # Use enum instead of string
storage_location=None, # None for managed volumes
comment="Volume for CSV uploads"
)
print(f"Created volume {volume_path}")
except Exception as e:
if "already exists" in str(e):
print(f"Volume {volume_path} already exists")
else:
raise e
# Step 2: Upload CSV files to the volume
print("Uploading CSV files to volume...")
for file_name in os.listdir(local_dir):
if file_name.endswith(".csv"):
local_file_path = os.path.join(local_dir, file_name)
volume_file_path = f"{volume_path}{file_name}"
with open(local_file_path, "rb") as f:
w.files.upload(file_path=volume_file_path, contents=f.read(), overwrite=True)
print(f"Uploaded {file_name} to {volume_file_path}")
# Step 3: Verify uploaded files
print("Listing files in volume:")
files = w.files.list_directory_contents(directory_path=volume_path)
for file in files:
print(f"File: {file.path}")
Here's my local folder:
Here's the two csvs in databricks (and me reading one in)
Was a really fun exercise! 😎. I think now it's a case of getting this python notebook to run on either a scheduled basis using something like windows task scheduler (or another orchestration method) or just triggering it as and when you need it.
If this solution has helped, please mark off the useful posts I provided as solutions.
Any questions, please reach out.
All the best,
BS
07-14-2025 01:12 AM
@giuseppe_esq personally, I'd love to get the Azure certs! I'll definitely be following the blog. Thanks a bunch for linking that 👌. I'm also learning Databricks so do keep in touch.
To answer this question:
Awesome, thanks again. Sorry if I sound clueless, but is there a reason why you use VS Code to create the SDK please? Is that to create it locally on my PC?
So once the CSV files have been imported, I assume you can create Delta tables for these in Databricks?
VSCode is just the environment I've always used to create Python Script / Jupter Notebooks. It's also got it's own Terminal i.e. Command Prompt / Powershell. This means that you can execute command line arguments i.e. pip install Databricks SDK and install the Databricks CLI. You don't have to use VSCode for this.
The reason why I chose the SDK/CLI route is that it means you can push things from your local machine straight to your databricks environment (which is really cool). Once the files get uploaded to a volume, you can then use Databricks itself to create the delta tables 👌. You could use something like CREATE TABLE AS (CTAS)
See this:
These slides are from the partner Data Engineering cohort I'm on (if you're curious). The COPY INTO command is also cool as it's idempotent so it'll only pick up new files! Definitely check that function/solution out.
Any questions, please reach out 🙂. If any of the posts have been useful, could you like them or mark them as a solution if they've solved your problem.
All the best.
BS
07-13-2025 10:11 AM
Hi @giuseppe_esq. I'm currently exploring the free edition too!
To answer your question about a Mount Script with a notebook, I've a feeling this won't be achievable (could be proven wrong though).
I looked at the documentation on the Free Edition limitations/compute: https://docs.databricks.com/aws/en/getting-started/free-edition-limitations
& the all-purpose compute is serverless. And according to the limitations on the serverless documentation:
https://docs.databricks.com/aws/en/compute/serverless/limitations
I appreciate that this is talking about AWS , but I think the same will be true for Azure. I still think it's worth trying, though 😂.
I'll answer your other question in a subsequent post.
All the best,
BS
07-13-2025 10:30 AM - edited 07-13-2025 10:45 AM
Hi @giuseppe_esq to answer your question about:
is there any way to create an automated process, that will allow me to import all of the CSV files (e.g. Customer, Employees, Products, Orders etc), stored locally, into newly created Databricks tables please?
I think there's a couple of routes we could look at for this. First of all, if you're on a windows machine, you could use something like "Task Scheduler" and create a python script (or script of your choice) to move/create the files you want into something like ADLS2 (or leverage one of the inbuilt Fivetran connectors?) and then let Databricks scan the storage. The key here, I guess, is using the task scheduler to automate this for you (to move it from the local machine).
You could also look at https://docs.databricks.com/aws/en/dev-tools/sdk-python. This looks really cool! I haven't used it yet but this seems like we could do something here. Check this post out: https://community.databricks.com/t5/data-engineering/is-it-possible-to-load-data-only-using-databric...
This seems like a promising route to explore 🙏. Hopefully, those routes may be of use!
Would love to see what the other community members recommend 👌
All the best,
BS
07-13-2025 10:42 AM
Thank you very much for the highly details and useful response. I suppose I am restricted by utilising the Free Edition version. I will look into your suggestions.
As I am currently out of work, and on a limited budget, do you know if there are any suitable Databricks subscriptions I could utilise please, so I can minimise my spending? I have been using Microsoft Azure, but their costs for Data Factory and Databricks build up very quickly.
I would love to see Databricks have it's own Data Factory type feature one day.
07-13-2025 11:22 AM
@giuseppe_esq
I built out a local solution based off my advice above using the Databricks Python SDK & ChatGPT (of course) 😂. I can confirm that I have been able to upload files from my local storage straight to my free edition databricks environment.
I just needed to install databricks cli and databricks SDK for python. The databricks cli was a couple of commands needed on command prompt. One to install it and another to setup authentication using the databricks cli to my databricks environment. I then created a virtual environment in python (not that you need to) and installed the python SDK.
Example of my python script (locally):
1. Showing it can authenticate to my free edition
2.
Here's the code I used to upload:
from databricks.sdk import WorkspaceClient
from databricks.sdk.service.catalog import VolumeType
import os
# Configuration
local_dir = r"C:\Users\benja\BLANK\Documents\Databricks\DummyData\GenerateData\Dataset.1" # Local CSV directory
volume_path = "/Volumes/dataengineering/dataingestion/cloudfiles/" # Target volume path
catalog = "dataengineering"
schema = "dataingestion"
volume = "cloudfiles"
# Initialize WorkspaceClient (uses %USERPROFILE%\.databrickscfg)
w = WorkspaceClient()
# Step 1: Create volume if it doesn't exist
try:
w.volumes.create(
catalog_name=catalog,
schema_name=schema,
name=volume,
volume_type=VolumeType.MANAGED, # Use enum instead of string
storage_location=None, # None for managed volumes
comment="Volume for CSV uploads"
)
print(f"Created volume {volume_path}")
except Exception as e:
if "already exists" in str(e):
print(f"Volume {volume_path} already exists")
else:
raise e
# Step 2: Upload CSV files to the volume
print("Uploading CSV files to volume...")
for file_name in os.listdir(local_dir):
if file_name.endswith(".csv"):
local_file_path = os.path.join(local_dir, file_name)
volume_file_path = f"{volume_path}{file_name}"
with open(local_file_path, "rb") as f:
w.files.upload(file_path=volume_file_path, contents=f.read(), overwrite=True)
print(f"Uploaded {file_name} to {volume_file_path}")
# Step 3: Verify uploaded files
print("Listing files in volume:")
files = w.files.list_directory_contents(directory_path=volume_path)
for file in files:
print(f"File: {file.path}")
Here's my local folder:
Here's the two csvs in databricks (and me reading one in)
Was a really fun exercise! 😎. I think now it's a case of getting this python notebook to run on either a scheduled basis using something like windows task scheduler (or another orchestration method) or just triggering it as and when you need it.
If this solution has helped, please mark off the useful posts I provided as solutions.
Any questions, please reach out.
All the best,
BS
07-13-2025 01:21 PM
Awesome, thanks again. Sorry if I sound clueless, but is there a reason why you use VS Code to create the SDK please? Is that to create it locally on my PC?
So once the CSV files have been imported, I assume you can create Delta tables for these in Databricks?
07-13-2025 03:13 PM
Not sure if it is helpful, but I have created some user guides for Azure etc, on my blog www.rainbowdatasolutions.com (I'm still learning 🙂
07-14-2025 01:12 AM
@giuseppe_esq personally, I'd love to get the Azure certs! I'll definitely be following the blog. Thanks a bunch for linking that 👌. I'm also learning Databricks so do keep in touch.
To answer this question:
Awesome, thanks again. Sorry if I sound clueless, but is there a reason why you use VS Code to create the SDK please? Is that to create it locally on my PC?
So once the CSV files have been imported, I assume you can create Delta tables for these in Databricks?
VSCode is just the environment I've always used to create Python Script / Jupter Notebooks. It's also got it's own Terminal i.e. Command Prompt / Powershell. This means that you can execute command line arguments i.e. pip install Databricks SDK and install the Databricks CLI. You don't have to use VSCode for this.
The reason why I chose the SDK/CLI route is that it means you can push things from your local machine straight to your databricks environment (which is really cool). Once the files get uploaded to a volume, you can then use Databricks itself to create the delta tables 👌. You could use something like CREATE TABLE AS (CTAS)
See this:
These slides are from the partner Data Engineering cohort I'm on (if you're curious). The COPY INTO command is also cool as it's idempotent so it'll only pick up new files! Definitely check that function/solution out.
Any questions, please reach out 🙂. If any of the posts have been useful, could you like them or mark them as a solution if they've solved your problem.
All the best.
BS
Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!
Sign Up Now