cancel
Showing results for 
Search instead for 
Did you mean: 
Databricks Free Trial Help
Engage in discussions about the Databricks Free Trial within the Databricks Community. Share insights, tips, and best practices for getting started, troubleshooting issues, and maximizing the value of your trial experience to explore Databricks' capabilities effectively.
cancel
Showing results for 
Search instead for 
Did you mean: 

How can I import multiple Files (stored locally) into Databricks Tables

giuseppe_esq
New Contributor III

First of all, Hello Everyone!

This is my first post on here. 

For the past few months, I have been working on my personal Microsoft Azure Blog, that includes some Azure Databricks user guides.  Sadly, as I am currently searching for work (that is rather challenging in the UK at present), I cannot afford to pay the fees associated with using Microsoft Azure.

Therefore, I decided to create a Databricks Free Edition Account, to see what is achievable, whilst minimising the costs.

Previously, I have used Data Factory to import several CSV files, with different schemas (for example, Customer.csv, Employees.csv) into an Azure Data Lake Storage Gen2 container. Next, then created a mount script, within a Notebook in Azure Databricks to the Data Lake, then created Databricks tables for each CSV file automatically.

My first question is, within Databricks Free Edition, is there any way to create an automated process, that will allow me to import all of the CSV files (e.g. Customer, Employees, Products, Orders etc), stored locally, into newly created Databricks tables please?  

I am not sure if there is limited functionality in the Free Edition that would allow me to do this?

If not, I assume I would need to use the Data Ingestion > Create of Modify Table function, and upload each file individually?

Any guidance would be greatly appreciated.

Thanks

Giuseppe

2 ACCEPTED SOLUTIONS

Accepted Solutions

BS_THE_ANALYST
Honored Contributor III

@giuseppe_esq 

I built out a local solution based off my advice above using the Databricks Python SDK & ChatGPT (of course) 😂. I can confirm that I have been able to upload files from my local storage straight to my free edition databricks environment. 

I just needed to install databricks cli and databricks SDK for python. The databricks cli was a couple of commands needed on command prompt. One to install it and another to setup authentication using the databricks cli to my databricks environment. I then created a virtual environment in python (not that you need to) and installed the python SDK. 

Example of my python script (locally):
1. Showing it can authenticate to my free edition

BS_THE_ANALYST_0-1752430663678.png


2. 

Here's the code I used to upload:

from databricks.sdk import WorkspaceClient
from databricks.sdk.service.catalog import VolumeType
import os

# Configuration
local_dir = r"C:\Users\benja\BLANK\Documents\Databricks\DummyData\GenerateData\Dataset.1"  # Local CSV directory
volume_path = "/Volumes/dataengineering/dataingestion/cloudfiles/"  # Target volume path
catalog = "dataengineering"
schema = "dataingestion"
volume = "cloudfiles"

# Initialize WorkspaceClient (uses %USERPROFILE%\.databrickscfg)
w = WorkspaceClient()

# Step 1: Create volume if it doesn't exist
try:
    w.volumes.create(
        catalog_name=catalog,
        schema_name=schema,
        name=volume,
        volume_type=VolumeType.MANAGED,  # Use enum instead of string
        storage_location=None,  # None for managed volumes
        comment="Volume for CSV uploads"
    )
    print(f"Created volume {volume_path}")
except Exception as e:
    if "already exists" in str(e):
        print(f"Volume {volume_path} already exists")
    else:
        raise e

# Step 2: Upload CSV files to the volume
print("Uploading CSV files to volume...")
for file_name in os.listdir(local_dir):
    if file_name.endswith(".csv"):
        local_file_path = os.path.join(local_dir, file_name)
        volume_file_path = f"{volume_path}{file_name}"
        with open(local_file_path, "rb") as f:
            w.files.upload(file_path=volume_file_path, contents=f.read(), overwrite=True)
        print(f"Uploaded {file_name} to {volume_file_path}")

# Step 3: Verify uploaded files
print("Listing files in volume:")
files = w.files.list_directory_contents(directory_path=volume_path)
for file in files:
    print(f"File: {file.path}")

Here's my local folder:

BS_THE_ANALYST_1-1752430749501.png



Here's the two csvs in databricks (and me reading one in)

BS_THE_ANALYST_2-1752430809047.png

Was a really fun exercise! 😎. I think now it's a case of getting this python notebook to run on either a scheduled basis using something like windows task scheduler (or another orchestration method) or just triggering it as and when you need it. 

If this solution has helped, please mark off the useful posts I provided as solutions. 

Any questions, please reach out. 

All the best,
BS

View solution in original post

BS_THE_ANALYST
Honored Contributor III

@giuseppe_esq personally, I'd love to get the Azure certs! I'll definitely be following the blog. Thanks a bunch for linking that 👌. I'm also learning Databricks so do keep in touch. 

To answer this question:

Awesome, thanks again.  Sorry if I sound clueless, but is there a reason why you use VS Code to create the SDK please? Is that to create it locally on my PC?

So once the CSV files have been imported, I assume you can create Delta tables for these in Databricks?

VSCode is just the environment I've always used to create Python Script / Jupter Notebooks. It's also got it's own Terminal i.e. Command Prompt / Powershell. This means that you can execute command line arguments i.e. pip install Databricks SDK and install the Databricks CLI. You don't have to use VSCode for this.

The reason why I chose the SDK/CLI route is that it means you can push things from your local machine straight to your databricks environment (which is really cool). Once the files get uploaded to a volume, you can then use Databricks itself to create the delta tables 👌. You could use something like CREATE TABLE AS (CTAS) 

See this:

BS_THE_ANALYST_0-1752480594792.png

BS_THE_ANALYST_1-1752480607458.png

These slides are from the partner Data Engineering cohort I'm on (if you're curious). The COPY INTO command is also cool as it's idempotent so it'll only pick up new files! Definitely check that function/solution out. 

Any questions, please reach out 🙂. If any of the posts have been useful, could you like them or mark them as a solution if they've solved your problem. 

All the best.
BS

View solution in original post

7 REPLIES 7

BS_THE_ANALYST
Honored Contributor III

Hi @giuseppe_esq. I'm currently exploring the free edition too!

 

To answer your question about a Mount Script with a notebook, I've a feeling this won't be achievable (could be proven wrong though). 

I looked at the documentation on the Free Edition limitations/compute: https://docs.databricks.com/aws/en/getting-started/free-edition-limitations 

BS_THE_ANALYST_0-1752426328650.png

 

& the all-purpose compute is serverless. And according to the limitations on the serverless documentation:
https://docs.databricks.com/aws/en/compute/serverless/limitations 

BS_THE_ANALYST_1-1752426414810.png

I appreciate that this is talking about AWS , but I think the same will be true for Azure. I still think it's worth trying, though 😂

I'll answer your other question in a subsequent post.

All the best,
BS

BS_THE_ANALYST
Honored Contributor III

Hi @giuseppe_esq  to answer your question about:

is there any way to create an automated process, that will allow me to import all of the CSV files (e.g. Customer, Employees, Products, Orders etc), stored locally, into newly created Databricks tables please?  

I think there's a couple of routes we could look at for this. First of all, if you're on a windows machine, you could use something like "Task Scheduler" and create a python script (or script of your choice) to move/create the files you want into something like ADLS2 (or leverage one of the inbuilt Fivetran connectors?) and then let Databricks scan the storage. The key here, I guess, is using the task scheduler to automate this for you (to move it from the local machine). 

You could also look at https://docs.databricks.com/aws/en/dev-tools/sdk-python. This looks really cool! I haven't used it yet but this seems like we could do something here. Check this post out: https://community.databricks.com/t5/data-engineering/is-it-possible-to-load-data-only-using-databric... 

BS_THE_ANALYST_2-1752427644506.png

This seems like a promising route to explore 🙏. Hopefully, those routes may be of use! 

Would love to see what the other community members recommend 👌

All the best,
BS

 

Thank you very much for the highly details and useful response.  I suppose I am restricted by utilising the Free Edition version.  I will look into your suggestions.  

As I am currently out of work, and on a limited budget, do you know if there are any suitable Databricks subscriptions I could utilise please, so I can minimise my spending?  I have been using Microsoft Azure, but their costs for Data Factory and Databricks build up very quickly.

I would love to see Databricks have it's own Data Factory type feature one day. 

BS_THE_ANALYST
Honored Contributor III

@giuseppe_esq 

I built out a local solution based off my advice above using the Databricks Python SDK & ChatGPT (of course) 😂. I can confirm that I have been able to upload files from my local storage straight to my free edition databricks environment. 

I just needed to install databricks cli and databricks SDK for python. The databricks cli was a couple of commands needed on command prompt. One to install it and another to setup authentication using the databricks cli to my databricks environment. I then created a virtual environment in python (not that you need to) and installed the python SDK. 

Example of my python script (locally):
1. Showing it can authenticate to my free edition

BS_THE_ANALYST_0-1752430663678.png


2. 

Here's the code I used to upload:

from databricks.sdk import WorkspaceClient
from databricks.sdk.service.catalog import VolumeType
import os

# Configuration
local_dir = r"C:\Users\benja\BLANK\Documents\Databricks\DummyData\GenerateData\Dataset.1"  # Local CSV directory
volume_path = "/Volumes/dataengineering/dataingestion/cloudfiles/"  # Target volume path
catalog = "dataengineering"
schema = "dataingestion"
volume = "cloudfiles"

# Initialize WorkspaceClient (uses %USERPROFILE%\.databrickscfg)
w = WorkspaceClient()

# Step 1: Create volume if it doesn't exist
try:
    w.volumes.create(
        catalog_name=catalog,
        schema_name=schema,
        name=volume,
        volume_type=VolumeType.MANAGED,  # Use enum instead of string
        storage_location=None,  # None for managed volumes
        comment="Volume for CSV uploads"
    )
    print(f"Created volume {volume_path}")
except Exception as e:
    if "already exists" in str(e):
        print(f"Volume {volume_path} already exists")
    else:
        raise e

# Step 2: Upload CSV files to the volume
print("Uploading CSV files to volume...")
for file_name in os.listdir(local_dir):
    if file_name.endswith(".csv"):
        local_file_path = os.path.join(local_dir, file_name)
        volume_file_path = f"{volume_path}{file_name}"
        with open(local_file_path, "rb") as f:
            w.files.upload(file_path=volume_file_path, contents=f.read(), overwrite=True)
        print(f"Uploaded {file_name} to {volume_file_path}")

# Step 3: Verify uploaded files
print("Listing files in volume:")
files = w.files.list_directory_contents(directory_path=volume_path)
for file in files:
    print(f"File: {file.path}")

Here's my local folder:

BS_THE_ANALYST_1-1752430749501.png



Here's the two csvs in databricks (and me reading one in)

BS_THE_ANALYST_2-1752430809047.png

Was a really fun exercise! 😎. I think now it's a case of getting this python notebook to run on either a scheduled basis using something like windows task scheduler (or another orchestration method) or just triggering it as and when you need it. 

If this solution has helped, please mark off the useful posts I provided as solutions. 

Any questions, please reach out. 

All the best,
BS

Awesome, thanks again.  Sorry if I sound clueless, but is there a reason why you use VS Code to create the SDK please? Is that to create it locally on my PC?

So once the CSV files have been imported, I assume you can create Delta tables for these in Databricks?

Not sure if it is helpful, but I have created some user guides for Azure etc, on my blog www.rainbowdatasolutions.com  (I'm still learning 🙂    

BS_THE_ANALYST
Honored Contributor III

@giuseppe_esq personally, I'd love to get the Azure certs! I'll definitely be following the blog. Thanks a bunch for linking that 👌. I'm also learning Databricks so do keep in touch. 

To answer this question:

Awesome, thanks again.  Sorry if I sound clueless, but is there a reason why you use VS Code to create the SDK please? Is that to create it locally on my PC?

So once the CSV files have been imported, I assume you can create Delta tables for these in Databricks?

VSCode is just the environment I've always used to create Python Script / Jupter Notebooks. It's also got it's own Terminal i.e. Command Prompt / Powershell. This means that you can execute command line arguments i.e. pip install Databricks SDK and install the Databricks CLI. You don't have to use VSCode for this.

The reason why I chose the SDK/CLI route is that it means you can push things from your local machine straight to your databricks environment (which is really cool). Once the files get uploaded to a volume, you can then use Databricks itself to create the delta tables 👌. You could use something like CREATE TABLE AS (CTAS) 

See this:

BS_THE_ANALYST_0-1752480594792.png

BS_THE_ANALYST_1-1752480607458.png

These slides are from the partner Data Engineering cohort I'm on (if you're curious). The COPY INTO command is also cool as it's idempotent so it'll only pick up new files! Definitely check that function/solution out. 

Any questions, please reach out 🙂. If any of the posts have been useful, could you like them or mark them as a solution if they've solved your problem. 

All the best.
BS

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!

Sign Up Now