cancel
Showing results for 
Search instead for 
Did you mean: 
Community Discussions
Connect with fellow community members to discuss general topics related to the Databricks platform, industry trends, and best practices. Share experiences, ask questions, and foster collaboration within the community.
cancel
Showing results for 
Search instead for 
Did you mean: 

SQL query on information_schema.tables via service principal

Sanky
New Contributor

Hi,

I have a simple python notebook with below code -

---

query = "select table_catalog, table_schema, table_name from system.information_schema.tables where table_type!='VIEW' and table_catalog='TEST' and table_schema='TEST'"
test = spark.sql(query)
display(test)
---
 
Running the above in my workspace returns a dataframe with appropriate results. But running the same as a task in a job which runs as a service principle is not returning any results and gives the output as "Query returned no results". I ensured service principal has select grant on system.information_schema.tables.
 
What am I missing here? Appreciate the help.
 
THank you
 
 
1 REPLY 1

Kaniz_Fatma
Community Manager
Community Manager

Hi @SankyIt seems you’re encountering an issue where your Spark job, running as a service principal, doesn’t return any results when querying the same code that works in your workspace.

Let’s troubleshoot this:

  1. Service Principal Permissions:

    • You mentioned that you’ve granted the service principal select permissions on system.information_schema.tables. Double-check that the permissions are correctly set and that there are no additional restrictions.
    • Ensure that the service principal has the necessary permissions to access the underlying data sources (e.g., HDFS, Hive, or other storage systems) that the query relies on.
  2. Hive Metastore Configuration:

    • When running Spark jobs, especially with Hive integration, ensure that the Hive metastore configuration is correctly set. The metastore stores metadata about tables, schemas, and other database objects.
    • Verify that the Spark session or context is correctly configured to use the Hive metastore. You can set this using spark.sql("set spark.sql.catalogImplementation=hive").
  3. Hive Context vs. SQL Context:

    • In your notebook, you’re using spark.sql(query) to execute the query. However, when running as a Spark job, make sure you’re using the appropriate context (Hive or SQL).
    • If you’re using Hive tables, consider using spark.sqlContext.sql(query) instead of spark.sql(query).
  4. Data Availability:

    • Check if the data exists in the specified catalog (TEST in your case). Ensure that the tables are populated with data.
    • Verify that the data is accessible from the Spark job environment (e.g., HDFS paths are accessible).
  5. Logging and Debugging:

    • Enable logging in your Spark job to capture any error messages or warnings.
    • Inspect the Spark job logs to identify any issues related to query execution or data retrieval.
  6. Query Optimization:

    • Sometimes queries that work well in a small dataset may not scale efficiently. Optimize your query to avoid unnecessary shuffling or expensive operations.
    • Consider using EXPLAIN or EXPLAIN EXTENDED to analyze the query execution plan and identify potential bottlenecks.
  7. Network and Firewall Settings:

    • Ensure that the Spark job environment has network connectivity to the necessary resources (e.g., Hive metastore, data storage).
    • Check if any firewall rules are blocking communication.
  8. Spark Version Compatibility:

    • Verify that the Spark version in your workspace matches the version used in the Spark job environment. Incompatibilities can lead to unexpected behaviour.
  9. Error Handling:

    • If the query fails, capture the error message. It might provide clues about the issue.
    • Consider wrapping your query execution in a try-catch block to handle exceptions gracefully.
  10. Test with a Simple Query:

    • Run a simple query (e.g., SELECT 1) in your Spark job environment to verify that basic functionality is working.
    • Gradually add complexity to the query to identify the specific point of failure.
 
Join 100K+ Data Experts: Register Now & Grow with Us!

Excited to expand your horizons with us? Click here to Register and begin your journey to success!

Already a member? Login and join your local regional user group! If there isn’t one near you, fill out this form and we’ll create one for you to join!