cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

How to read data from a table into a dataframe outside of Databricks environment?

AnuVat
New Contributor III

Hi,

I am working on an ML project and I need to access the data in tables hosted in my Databricks cluster through a notebook that I am running locally. This has been very easy while I run the notebooks in Databricks but I cannot figure out how to do this in my local notebook.

I want to accomplish the same functionality in the snippet below.

spark_dataframe = spark.read.table(table_name)

How can I get data frame like this when I am not in the Databricks environment?

Thanks in advance 🙂

1 ACCEPTED SOLUTION

Accepted Solutions

Mickeylopez
New Contributor III

To read data from a table into a dataframe outside of Databricks environment, you can use one of the many available Python libraries, such as Pandas or PyODBC, depending on the type of table and database you are using. Here are the general steps you can follow:

Install the necessary library: If you are using a library like Pandas, you can install it using pip. For example, you can open a terminal or command prompt and type: pip install pandas.

Import the library: In your Python script or notebook, import the library using the import statement. For example: import pandas as pd.

Connect to the database: Depending on the type of database you are using, you will need to provide connection details, such as the server address, database name, username, and password. If you are using PyODBC, you can use the pyodbc.connect function to create a connection object. For example:

import pyodbc

conn = pyodbc.connect('Driver={SQL Server};'

           'Server=myServerName;'

           'Database=myDatabaseName;'

           'Trusted_Connection=yes;')

Read the data into a dataframe: Once you have established a connection, you can use the pd.read_sql function in Pandas to read the data into a dataframe. For example:

df = pd.read_sql('SELECT * FROM myTable', conn)

This will read all the data from the "myTable" table into a dataframe called "df". You can then manipulate the data as needed using Pandas functions.

View solution in original post

7 REPLIES 7

Anonymous
Not applicable

The cluster, databricks notebook, and table are all in the cloud. Your local machine is not in the cloud. If you want to read data to your machine you have a few options:

  1. There is a download button in databricks notebooks that will let you locally download the data as a csv file
  2. You can use whatever storage(S3, ADL2) tool to download the data
  3. You can connect to a databricks cluster from your local notebook. DBX may help here

All that said, your best option is to use the Databricks ML Runtime for an ML project. It has mlflow built in and will give you better scaleability than doing things locally. It also has automl which will be a great starting step for data profiling and simple model building.

AnuVat
New Contributor III

That's super helpful. Thanks for clarifying Joseph!

Hi @Anu Vats​,

Just a friendly follow-up. Did Joseph response helped you to resolved your issue? if it did, please mark it as best.

Mickeylopez
New Contributor III

To read data from a table into a dataframe outside of Databricks environment, you can use one of the many available Python libraries, such as Pandas or PyODBC, depending on the type of table and database you are using. Here are the general steps you can follow:

Install the necessary library: If you are using a library like Pandas, you can install it using pip. For example, you can open a terminal or command prompt and type: pip install pandas.

Import the library: In your Python script or notebook, import the library using the import statement. For example: import pandas as pd.

Connect to the database: Depending on the type of database you are using, you will need to provide connection details, such as the server address, database name, username, and password. If you are using PyODBC, you can use the pyodbc.connect function to create a connection object. For example:

import pyodbc

conn = pyodbc.connect('Driver={SQL Server};'

           'Server=myServerName;'

           'Database=myDatabaseName;'

           'Trusted_Connection=yes;')

Read the data into a dataframe: Once you have established a connection, you can use the pd.read_sql function in Pandas to read the data into a dataframe. For example:

df = pd.read_sql('SELECT * FROM myTable', conn)

This will read all the data from the "myTable" table into a dataframe called "df". You can then manipulate the data as needed using Pandas functions.

chakri
New Contributor III

Very helpful answer

Anonymous
Not applicable

Hi @Anu Vats​ 

Hope all is well! Just wanted to check in if you were able to resolve your issue and would you be happy to share the solution or mark an answer as best? Else please let us know if you need more help. 

We'd love to hear from you.

Thanks!

chakri
New Contributor III

We can use Apis and pyodbc to achieve this. Once go through the official documentation of databricks that might be helpful to access outside of the databricks environment.

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group