Databricks Community

Sanky · ‎02-19-2024

Hi,

I have a simple python notebook with below code -

---

query = "select table_catalog, table_schema, table_name from system.information_schema.tables where table_type!='VIEW' and table_catalog='TEST' and table_schema='TEST'"

test = spark.sql(query)

display(test)

---

Running the above in my workspace returns a dataframe with appropriate results. But running the same as a task in a job which runs as a service principle is not returning any results and gives the output as "Query returned no results". I ensured service principal has select grant on system.information_schema.tables.

What am I missing here? Appreciate the help.

THank you

Kaniz_Fatma · ‎02-19-2024

Hi @Sanky, It seems you’re encountering an issue where your Spark job, running as a service principal, doesn’t return any results when querying the same code that works in your workspace.

Let’s troubleshoot this:

Service Principal Permissions:
- You mentioned that you’ve granted the service principal select permissions on system.information_schema.tables. Double-check that the permissions are correctly set and that there are no additional restrictions.
- Ensure that the service principal has the necessary permissions to access the underlying data sources (e.g., HDFS, Hive, or other storage systems) that the query relies on.
Hive Metastore Configuration:
- When running Spark jobs, especially with Hive integration, ensure that the Hive metastore configuration is correctly set. The metastore stores metadata about tables, schemas, and other database objects.
- Verify that the Spark session or context is correctly configured to use the Hive metastore. You can set this using spark.sql("set spark.sql.catalogImplementation=hive").
Hive Context vs. SQL Context:
- In your notebook, you’re using spark.sql(query) to execute the query. However, when running as a Spark job, make sure you’re using the appropriate context (Hive or SQL).
- If you’re using Hive tables, consider using spark.sqlContext.sql(query) instead of spark.sql(query).
Data Availability:
- Check if the data exists in the specified catalog (TEST in your case). Ensure that the tables are populated with data.
- Verify that the data is accessible from the Spark job environment (e.g., HDFS paths are accessible).
Logging and Debugging:
- Enable logging in your Spark job to capture any error messages or warnings.
- Inspect the Spark job logs to identify any issues related to query execution or data retrieval.
Query Optimization:
- Sometimes queries that work well in a small dataset may not scale efficiently. Optimize your query to avoid unnecessary shuffling or expensive operations.
- Consider using EXPLAIN or EXPLAIN EXTENDED to analyze the query execution plan and identify potential bottlenecks.
Network and Firewall Settings:
- Ensure that the Spark job environment has network connectivity to the necessary resources (e.g., Hive metastore, data storage).
- Check if any firewall rules are blocking communication.
Spark Version Compatibility:
- Verify that the Spark version in your workspace matches the version used in the Spark job environment. Incompatibilities can lead to unexpected behaviour.
Error Handling:
- If the query fails, capture the error message. It might provide clues about the issue.
- Consider wrapping your query execution in a try-catch block to handle exceptions gracefully.
Test with a Simple Query:
- Run a simple query (e.g., SELECT 1) in your Spark job environment to verify that basic functionality is working.
- Gradually add complexity to the query to identify the specific point of failure.