โ03-06-2024 11:18 PM
I am trying to load data from a table in SQL warehouse using spark.sql("SELECT * FROM <table>") in a spark submit job, but the job is failing with [TABLE_OR_VIEW_NOT_FOUND] The table or view . The same statement is working in notebook but not in a job. How to query the SQL warehouse from the spark job.
โ03-07-2024 01:23 AM
Hi @akhileshp, When querying a SQL warehouse from a Spark job, there are a few considerations to keep in mind.
Letโs break down the steps to address your issue:
Databricks SQL Connector:
PySpark Approach:
#!/usr/bin/env python3
from pyspark.sql import SparkSession
# Set up the SparkSession
spark = SparkSession.builder \
.appName("SQL Warehouse Query") \
.config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog") \
.config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
.config("spark.sql.execution.arrow.pyspark.enabled", "true") \
.getOrCreate()
# Specify your SQL warehouse connection details
server_hostname = "adb-xxx.azuredatabricks.net"
http_path = "/sql/1.0/warehouses/xxx"
access_token = "xxx"
# Read data from your SQL warehouse table
table_name = "democatalog.users.people"
df = spark.read \
.format("jdbc") \
.option("url", f"jdbc:sqlserver://{server_hostname}{http_path}") \
.option("dbtable", table_name) \
.option("user", access_token) \
.option("password", "") \
.load()
# Show the first few rows
df.show()
Adjustments:
server_hostname
, http_path
, and access_token
) with your actual connection details.table_name
to match the table you want to query.GitHub Examples:
Remember to adapt the code to your specific environment and SQL warehouse configuration. If you encounter any further issues, feel free to ask for additional assistance! ๐
โ03-07-2024 03:23 AM
I am running the spark job in databricks, so it is running in databricks runtime itself then why is the behaviour different with the notebook and a spark submit job
โ03-07-2024 04:12 AM
Hey,
Is your table in Unity Catalog or in Hive Metastore ?
just my guess(es):
- use full table path Catalog_name.Schema_Name.Table_name (maybe you had some context saved when querying data interactively.
- if your table is in UC - is your Job Cluster running in UC enabled mode (single / shared) ?
From your post it is bit uncleare where and what you are running.
โ03-07-2024 04:18 AM
The table is in hive_metastore and I am using the full path.
The job compute is created using an instance pool
โ03-07-2024 06:19 AM
- when you query table manually and running job - do both those actions happens in same Databricks Workspace
- what is job configuration - who is job Owner or Run As Account -> do this principal/persona has access to the table ?
โ03-07-2024 07:45 AM
Yes, Both are in the same workspace and querying same table
Yes the principle has all the access.
Also its working with spark python job, the problem is with spark submit job only.
My usecase is supported only in submit job since I am having multiple scripts in a zip file
Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections.
Click here to register and join today!
Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.