cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

External table from parquet partition

MBV3
New Contributor III

Hi,

I have data in parquet format in GCS buckets partitioned by name eg. gs://mybucket/name=ABCD/

I am trying to create a table in Databaricks as follows

DROP TABLE IF EXISTS name_test; 

CREATE TABLE name_test

USING parquet

LOCATION "gs://mybucket/name=*/";

It is loading all the columns except name !

I get the following error when trying to view the schema from Data console

An error occurred while fetching table: name_test

com.databricks.backend.common.rpc.DatabricksExceptions$SQLExecutionException: java.util.NoSuchElementException: key not found: name

It should infer the schema automatically I suppose,

Can someone let me know where I am doing wrong?

Thanks

1 ACCEPTED SOLUTION

Accepted Solutions

MBV3
New Contributor III
CREATE TABLE name_test
(
id INT,
other STRING
)
    PARTITIONED BY ( name STRING);
 
COPY INTO name_test
FROM "gs://<my_bucket>/"
FILEFORMAT = parquet;
 

This combination worked, after some trial and error.

Thanks for your help Pat.

View solution in original post

5 REPLIES 5

Pat
Honored Contributor III

Hi @M Baig​ ,

the error doesn't tell me much, but you could try:

CREATE TABLE name_test
USING parquet
PARTITIONED BY ( name STRING)
LOCATION "gs://mybucket/";

MBV3
New Contributor III

Hi Pat

Thanks for your reply.

I tried your approach but it didn't worked getting the following exception

Error in SQL statement: AnalysisException: Cannot use all columns for partition columns
com.databricks.backend.common.rpc.DatabricksExceptions$SQLExecutionException: org.apache.spark.sql.AnalysisException: Cannot use all columns for partition columns
    at org.apache.spark.sql.execution.datasources.PreprocessTableCreation.org$apache$spark$sql$execution$datasources$PreprocessTableCreation$$failAnalysis(rules.scala:454)
    at org.apache.spark.sql.execution.datasources.PreprocessTableCreation.normalizePartitionColumns(rules.scala:414)
    at org.apache.spark.sql.execution.datasources.PreprocessTableCreation.org$apache$spark$sql$execution$datasources$PreprocessTableCreation$$normalizeCatalogTable(rules.scala:384)
    at org.apache.spark.sql.execution.datasources.PreprocessTableCreation$$anonfun$apply$2.applyOrElse(rules.scala:323)
    at org.apache.spark.sql.execution.datasources.PreprocessTableCreation$$anonfun$apply$2.applyOrElse(rules.scala:182)
    at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.$anonfun$resolveOperatorsDownWithPruning$2(AnalysisHelper.scala:171)
    at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:167)
    at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.$anonfun$resolveOperatorsDownWithPruning$1(AnalysisHelper.scala:171)
    at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.allowInvokingTransformsInAnalyzer(AnalysisHelper.scala:324)
    at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.resolveOperatorsDownWithPruning(AnalysisHelper.scala:169)
    at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.resolveOperatorsDownWithPruning$(AnalysisHelper.scala:165)
    at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperatorsDownWithPruning(LogicalPlan.scala:30)
    at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.resolveOperatorsWithPruning(AnalysisHelper.scala:99)
    at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.resolveOperatorsWithPruning$(AnalysisHelper.scala:96)
    at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperatorsWithPruning(LogicalPlan.scala:30)
    at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.resolveOperators(AnalysisHelper.scala:76)
    at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.resolveOperators$(AnalysisHelper.scala:75)
    at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperators(LogicalPlan.scala:30)
    at org.apache.spark.sql.execution.datasources.PreprocessTableCreation.apply(rules.scala:182)
    at org.apache.spark.sql.execution.datasources.PreprocessTableCreation.apply(rules.scala:178)
    at org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$3(RuleExecutor.scala:216)
    at com.databricks.spark.util.FrameProfiler$.record(FrameProfiler.scala:80)
    at org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$2(RuleExecutor.scala:216)
    at scala.collection.LinearSeqOptimized.foldLeft(LinearSeqOptimized.scala:126)
    at scala.collection.LinearSeqOptimized.foldLeft$(LinearSeqOptimized.scala:122)
    at scala.collection.immutable.List.foldLeft(List.scala:91)
    at org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$1(RuleExecutor.scala:213)
    at org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$1$adapted(RuleExecutor.scala:205)
    at scala.collection.immutable.List.foreach(List.scala:431)
    at org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:205)
    at org.apache.spark.sql.catalyst.analysis.Analyzer.org$apache$spark$sql$catalyst$analysis$Analyzer$$executeSameContext(Analyzer.scala:301)
    at org.apache.spark.sql.catalyst.analysis.Analyzer.$anonfun$execute$1(Analyzer.scala:294)
    at org.apache.spark.sql.catalyst.analysis.AnalysisContext$.withNewAnalysisContext(Analyzer.scala:196)
    at org.apache.spark.sql.catalyst.analysis.Analyzer.execute(Analyzer.scala:294)
    at org.apache.spark.sql.catalyst.analysis.Analyzer.execute(Analyzer.scala:222)
    at org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$executeAndTrack$1(RuleExecutor.scala:184)
    at org.apache.spark.sql.catalyst.QueryPlanningTracker$.withTracker(QueryPlanningTracker.scala:130)
    at org.apache.spark.sql.catalyst.rules.RuleExecutor.executeAndTrack(RuleExecutor.scala:184)
    at org.apache.spark.sql.catalyst.analysis.Analyzer.$anonfun$executeAndCheck$1(Analyzer.scala:274)
    at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.markInAnalyzer(AnalysisHelper.scala:331)
    at org.apache.spark.sql.catalyst.analysis.Analyzer.executeAndCheck(Analyzer.scala:273)
    at org.apache.spark.sql.execution.QueryExecution.$anonfun$analyzed$1(QueryExecution.scala:128)
    at com.databricks.spark.util.FrameProfiler$.record(FrameProfiler.scala:80)
    at org.apache.spark.sql.catalyst.QueryPlanningTracker.measurePhase(QueryPlanningTracker.scala:268)
    at org.apache.spark.sql.execution.QueryExecution.$anonfun$executePhase$1(QueryExecution.scala:265)
    at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:968)
    at org.apache.spark.sql.execution.QueryExecution.executePhase(QueryExecution.scala:265)
    at org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:129)
    at org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:126)
    at org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:118)
    at org.apache.spark.sql.Dataset$.$anonfun$ofRows$2(Dataset.scala:103)
    at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:968)
    at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:101)
    at org.apache.spark.sql.SparkSession.$anonfun$sql$1(SparkSession.scala:803)
    at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:968)
    at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:798)
    at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:695)
    at com.databricks.backend.daemon.driver.SQLDriverLocal.$anonfun$executeSql$1(SQLDriverLocal.scala:91)
    at scala.collection.immutable.List.map(List.scala:297)
    at com.databricks.backend.daemon.driver.SQLDriverLocal.executeSql(SQLDriverLocal.scala:37)
    at com.databricks.backend.daemon.driver.SQLDriverLocal.repl(SQLDriverLocal.scala:145)
    at com.databricks.backend.daemon.driver.DriverLocal.$anonfun$execute$12(DriverLocal.scala:631)
    at com.databricks.logging.Log4jUsageLoggingShim$.$anonfun$withAttributionContext$1(Log4jUsageLoggingShim.scala:33)
    at scala.util.DynamicVariable.withValue(DynamicVariable.scala:62)
    at com.databricks.logging.AttributionContext$.withValue(AttributionContext.scala:94)
    at com.databricks.logging.Log4jUsageLoggingShim$.withAttributionContext(Log4jUsageLoggingShim.scala:31)
    at com.databricks.logging.UsageLogging.withAttributionContext(UsageLogging.scala:205)
    at com.databricks.logging.UsageLogging.withAttributionContext$(UsageLogging.scala:204)
    at com.databricks.backend.daemon.driver.DriverLocal.withAttributionContext(DriverLocal.scala:59)
    at com.databricks.logging.UsageLogging.withAttributionTags(UsageLogging.scala:240)
    at com.databricks.logging.UsageLogging.withAttributionTags$(UsageLogging.scala:225)
    at com.databricks.backend.daemon.driver.DriverLocal.withAttributionTags(DriverLocal.scala:59)
    at com.databricks.backend.daemon.driver.DriverLocal.execute(DriverLocal.scala:608)
    at com.databricks.backend.daemon.driver.DriverWrapper.$anonfun$tryExecutingCommand$1(DriverWrapper.scala:615)
    at scala.util.Try$.apply(Try.scala:213)
    at com.databricks.backend.daemon.driver.DriverWrapper.tryExecutingCommand(DriverWrapper.scala:607)
    at com.databricks.backend.daemon.driver.DriverWrapper.executeCommandAndGetError(DriverWrapper.scala:526)
    at com.databricks.backend.daemon.driver.DriverWrapper.executeCommand(DriverWrapper.scala:561)
    at com.databricks.backend.daemon.driver.DriverWrapper.runInnerLoop(DriverWrapper.scala:431)
    at com.databricks.backend.daemon.driver.DriverWrapper.runInner(DriverWrapper.scala:374)
    at com.databricks.backend.daemon.driver.DriverWrapper.run(DriverWrapper.scala:225)
    at java.lang.Thread.run(Thread.java:748)
 

Pat
Honored Contributor III

I cannot test it now, but maybe you can try this way:

CREATE TABLE name_test
USING parquet
LOCATION "gs://mybucket/";

It might discover that table is partitioned by `name`, I don't remember right now.

The issues with my previous statement is that you would have to specify columns manually:

    CREATE TABLE name_test
(
id INT,
other STRING
)
    USING parquet
    PARTITIONED BY ( name STRING)
    LOCATION "gs://mybucket/";

MBV3
New Contributor III
CREATE TABLE name_test
(
id INT,
other STRING
)
    PARTITIONED BY ( name STRING);
 
COPY INTO name_test
FROM "gs://<my_bucket>/"
FILEFORMAT = parquet;
 

This combination worked, after some trial and error.

Thanks for your help Pat.

data_loader
New Contributor II

Does this create an external table?

 

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group