Databricks Community

dream · ‎07-01-2024

I have two external locations. On both of these locations I have `ALL PRIVILEGES` access.

I am creating a table on the first external location using the following command:

%sql
create or replace table delta.`s3://avinashkhamanekar/tmp/test_table_original12`
as
select * from range(100000)

Next, I am creating a shallow clone of this table in an another external location.

%sql
create or replace table delta.`s3://tupboto3harsh/tmp/test_table_cloned12`
shallow clone
delta.`s3://avinashkhamanekar/tmp/test_table_original12`

Both of these commands run successfully. But when I try to access the shallow cloned table I get the following error:

%sql
select * from delta.`s3://tupboto3harsh/tmp/test_table_cloned12`

SparkException: Job aborted due to stage failure: Task 0 in stage 26.0 failed 4 times, most recent failure: Lost task 0.3 in stage 26.0 (TID 22) (ip-10-96-163-239.ap-southeast-1.compute.internal executor driver): com.databricks.sql.io.FileReadException: Error while reading file s3://avinashkhamanekar/tmp/test_table_original12/part-00000-36ee2e95-cfb1-449b-a986-21657cc01b22-c000.snappy.parquet. Access Denied against cloud provider
	at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1$$anon$2.logFileNameAndThrow(FileScanRDD.scala:724)
	at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1$$anon$2.getNext(FileScanRDD.scala:691)
	at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
	at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:818)
	at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.$anonfun$hasNext$1(FileScanRDD.scala:510)
	at scala.runtime.java8.JFunction0$mcZ$sp.apply(JFunction0$mcZ$sp.java:23)
	at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
	at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:501)
	at org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.hasNext(DataSourceScanExec.scala:2624)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown Source)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
	at org.apache.spark.sql.execution.WholeStageCodegenEvaluatorFactory$WholeStageCodegenPartitionEvaluator$$anon$1.hasNext(WholeStageCodegenEvaluatorFactory.scala:43)
	at org.apache.spark.sql.execution.collect.UnsafeRowBatchUtils$.$anonfun$encodeUnsafeRows$5(UnsafeRowBatchUtils.scala:88)
	at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
	at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
	at org.apache.spark.sql.execution.collect.UnsafeRowBatchUtils$.$anonfun$encodeUnsafeRows$3(UnsafeRowBatchUtils.scala:88)
	at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
	at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
	at org.apache.spark.sql.execution.collect.UnsafeRowBatchUtils$.$anonfun$encodeUnsafeRows$1(UnsafeRowBatchUtils.scala:68)
	at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
	at org.apache.spark.sql.execution.collect.UnsafeRowBatchUtils$.encodeUnsafeRows(UnsafeRowBatchUtils.scala:62)
	at org.apache.spark.sql.execution.collect.Collector.$anonfun$processFunc$2(Collector.scala:214)
	at org.apache.spark.scheduler.ResultTask.$anonfun$runTask$3(ResultTask.scala:82)
	at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
	at org.apache.spark.scheduler.ResultTask.$anonfun$runTask$1(ResultTask.scala:82)
	at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
	at org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:201)
	at org.apache.spark.scheduler.Task.doRunTask(Task.scala:186)
	at org.apache.spark.scheduler.Task.$anonfun$run$5(Task.scala:151)
	at com.databricks.unity.UCSEphemeralState$Handle.runWith(UCSEphemeralState.scala:45)
	at com.databricks.unity.HandleImpl.runWith(UCSHandle.scala:103)
	at com.databricks.unity.HandleImpl.$anonfun$runWithAndClose$1(UCSHandle.scala:108)
	at scala.util.Using$.resource(Using.scala:269)
	at com.databricks.unity.HandleImpl.runWithAndClose(UCSHandle.scala:107)
	at org.apache.spark.scheduler.Task.$anonfun$run$1(Task.scala:145)
	at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
	at org.apache.spark.scheduler.Task.run(Task.scala:99)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$9(Executor.scala:958)
	at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally(SparkErrorUtils.scala:64)
	at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally$(SparkErrorUtils.scala:61)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:105)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:961)
	at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
	at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:853)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:750)
Caused by: org.apache.spark.SparkRuntimeException: [CLOUD_PROVIDER_ERROR] Cloud provider error: AmazonS3Exception: The AWS Access Key Id you provided does not exist in our records.; request: GET https://avinashkhamanekar.s3.ap-southeast-1.amazonaws.com tmp/test_table_original12/part-00000-36ee2e95-cfb1-449b-a986-21657cc01b22-c000.snappy.parquet {} Hadoop 3.3.6, aws-sdk-java/1.12.390 Linux/5.15.0-1058-aws OpenJDK_64-Bit_Server_VM/25.392-b08 java/1.8.0_392 scala/2.12.15 kotlin/1.6.0 vendor/Azul_Systems,_Inc. cfg/retry-mode/legacy com.amazonaws.services.s3.model.GetObjectRequest; Request ID: XT85JHKSN1MJM6BE, Extended Request ID: QonH0Konw+DJAtrVirDgZ7m60L/8fKKXPfh3xZEWFbMgUeh7sM70XWM4tu/iotJdreravwGh0U4=, Cloud Provider: AWS, Instance ID: i-0bfe2e8c5979ba122 credentials-provider: com.amazonaws.auth.BasicAWSCredentials credential-header: AWS4-HMAC-SHA256 Credential=AKIA4MTWJV3MHJUGAQMT/20240701/ap-southeast-1/s3/aws4_request signature-present: true (Service: Amazon S3; Status Code: 403; Error Code: InvalidAccessKeyId; Request ID: XT85JHKSN1MJM6BE; S3 Extended Request ID: QonH0Konw+DJAtrVirDgZ7m60L/8fKKXPfh3xZEWFbMgUeh7sM70XWM4tu/iotJdreravwGh0U4=; Proxy: null) SQLSTATE: 58000
	at org.apache.spark.sql.errors.QueryExecutionErrors$.cloudProviderError(QueryExecutionErrors.scala:1487)
	... 49 more
Caused by: java.nio.file.AccessDeniedException: s3a://avinashkhamanekar/tmp/test_table_original12/part-00000-36ee2e95-cfb1-449b-a986-21657cc01b22-c000.snappy.parquet: open s3a://avinashkhamanekar/tmp/test_table_original12/part-00000-36ee2e95-cfb1-449b-a986-21657cc01b22-c000.snappy.parquet at 0 on s3a://avinashkhamanekar/tmp/test_table_original12/part-00000-36ee2e95-cfb1-449b-a986-21657cc01b22-c000.snappy.parquet: com.amazonaws.services.s3.model.AmazonS3Exception: The AWS Access Key Id you provided does not exist in our records.; request: GET https://avinashkhamanekar.s3.ap-southeast-1.amazonaws.com tmp/test_table_original12/part-00000-36ee2e95-cfb1-449b-a986-21657cc01b22-c000.snappy.parquet {} Hadoop 3.3.6, aws-sdk-java/1.12.390 Linux/5.15.0-1058-aws OpenJDK_64-Bit_Server_VM/25.392-b08 java/1.8.0_392 scala/2.12.15 kotlin/1.6.0 vendor/Azul_Systems,_Inc. cfg/retry-mode/legacy com.amazonaws.services.s3.model.GetObjectRequest; Request ID: XT85JHKSN1MJM6BE, Extended Request ID: QonH0Konw+DJAtrVirDgZ7m60L/8fKKXPfh3xZEWFbMgUeh7sM70XWM4tu/iotJdreravwGh0U4=, Cloud Provider: AWS, Instance ID: i-0bfe2e8c5979ba122 credentials-provider: com.amazonaws.auth.BasicAWSCredentials credential-header: AWS4-HMAC-SHA256 Credential=AKIA4MTWJV3MHJUGAQMT/20240701/ap-southeast-1/s3/aws4_request signature-present: true (Service: Amazon S3; Status Code: 403; Error Code: InvalidAccessKeyId; Request ID: XT85JHKSN1MJM6BE; S3 Extended Request ID: QonH0Konw+DJAtrVirDgZ7m60L/8fKKXPfh3xZEWFbMgUeh7sM70XWM4tu/iotJdreravwGh0U4=; Proxy: null), S3 Extended Request ID: QonH0Konw+DJAtrVirDgZ7m60L/8fKKXPfh3xZEWFbMgUeh7sM70XWM4tu/iotJdreravwGh0U4=:InvalidAccessKeyId
	at shaded.databricks.org.apache.hadoop.fs.s3a.S3AUtils.translateException(S3AUtils.java:292)
	at shaded.databricks.org.apache.hadoop.fs.s3a.Invoker.once(Invoker.java:135)
	at shaded.databricks.org.apache.hadoop.fs.s3a.Invoker.once(Invoker.java:127)
	at shaded.databricks.org.apache.hadoop.fs.s3a.S3AInputStream.reopen(S3AInputStream.java:277)
	at shaded.databricks.org.apache.hadoop.fs.s3a.S3AInputStream.lambda$lazySeek$2(S3AInputStream.java:466)
	at shaded.databricks.org.apache.hadoop.fs.s3a.Invoker.lambda$maybeRetry$3(Invoker.java:246)
	at shaded.databricks.org.apache.hadoop.fs.s3a.Invoker.once(Invoker.java:133)
	at shaded.databricks.org.apache.hadoop.fs.s3a.Invoker.once(Invoker.java:127)
	at shaded.databricks.org.apache.hadoop.fs.s3a.Invoker.lambda$maybeRetry$5(Invoker.java:370)
	at shaded.databricks.org.apache.hadoop.fs.s3a.Invoker.retryUntranslated(Invoker.java:434)
	at shaded.databricks.org.apache.hadoop.fs.s3a.Invoker.maybeRetry(Invoker.java:366)
	at shaded.databricks.org.apache.hadoop.fs.s3a.Invoker.maybeRetry(Invoker.java:244)
	at shaded.databricks.org.apache.hadoop.fs.s3a.Invoker.maybeRetry(Invoker.java:288)
	at shaded.databricks.org.apache.hadoop.fs.s3a.S3AInputStream.lazySeek(S3AInputStream.java:459)
	at shaded.databricks.org.apache.hadoop.fs.s3a.S3AInputStream.read(S3AInputStream.java:571)
	at java.io.DataInputStream.read(DataInputStream.java:149)
	at com.databricks.common.filesystem.LokiS3AInputStream.$anonfun$read$3(LokiS3FS.scala:254)
	at scala.runtime.java8.JFunction0$mcI$sp.apply(JFunction0$mcI$sp.java:23)
	at com.databricks.common.filesystem.LokiS3AInputStream.withExceptionRewrites(LokiS3FS.scala:244)
	at com.databricks.common.filesystem.LokiS3AInputStream.read(LokiS3FS.scala:254)
	at java.io.DataInputStream.read(DataInputStream.java:149)
	at com.databricks.spark.metrics.FSInputStreamWithMetrics.$anonfun$read$3(FileSystemWithMetrics.scala:90)
	at com.databricks.spark.metrics.FSInputStreamWithMetrics.withTimeAndBytesReadMetric(FileSystemWithMetrics.scala:67)
	at com.databricks.spark.metrics.FSInputStreamWithMetrics.read(FileSystemWithMetrics.scala:90)
	at java.io.DataInputStream.read(DataInputStream.java:149)
	at com.databricks.sql.io.HDFSStorage.lambda$fetchRange$1(HDFSStorage.java:88)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at com.databricks.sql.io.HDFSStorage.submit(HDFSStorage.java:119)
	at com.databricks.sql.io.HDFSStorage.fetchRange(HDFSStorage.java:82)
	at com.databricks.sql.io.Storage.fetchRange(Storage.java:183)
	at com.databricks.sql.io.parquet.CachingParquetFileReader$FooterByteReader.readTail(CachingParquetFileReader.java:345)
	at com.databricks.sql.io.parquet.CachingParquetFileReader$FooterByteReader.read(CachingParquetFileReader.java:363)
	at com.databricks.sql.io.parquet.CachingParquetFooterReader.lambda$null$1(CachingParquetFooterReader.java:231)
	at com.databricks.sql.io.caching.NativeDiskCache$.get(Native Method)
	at com.databricks.sql.io.caching.DiskCache.get(DiskCache.scala:568)
	at com.databricks.sql.io.parquet.CachingParquetFooterReader.lambda$readFooterFromStorage$2(CachingParquetFooterReader.java:234)
	at org.apache.spark.util.JavaFrameProfiler.record(JavaFrameProfiler.java:18)
	at com.databricks.sql.io.parquet.CachingParquetFooterReader.readFooterFromStorage(CachingParquetFooterReader.java:214)
	at com.databricks.sql.io.parquet.CachingParquetFooterReader.readFooter(CachingParquetFooterReader.java:134)
	at com.databricks.sql.io.parquet.CachingParquetFileReader.readFooter(CachingParquetFileReader.java:392)
	at org.apache.spark.sql.execution.datasources.parquet.SpecificParquetRecordReaderBase.prepare(SpecificParquetRecordReaderBase.java:162)
	at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anon$1.apply(ParquetFileFormat.scala:415)
	at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anon$1.apply(ParquetFileFormat.scala:259)
	at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1$$anon$2.getNext(FileScanRDD.scala:608)
	... 48 more
Caused by: com.amazonaws.services.s3.model.AmazonS3Exception: The AWS Access Key Id you provided does not exist in our records.; request: GET https://avinashkhamanekar.s3.ap-southeast-1.amazonaws.com tmp/test_table_original12/part-00000-36ee2e95-cfb1-449b-a986-21657cc01b22-c000.snappy.parquet {} Hadoop 3.3.6, aws-sdk-java/1.12.390 Linux/5.15.0-1058-aws OpenJDK_64-Bit_Server_VM/25.392-b08 java/1.8.0_392 scala/2.12.15 kotlin/1.6.0 vendor/Azul_Systems,_Inc. cfg/retry-mode/legacy com.amazonaws.services.s3.model.GetObjectRequest; Request ID: XT85JHKSN1MJM6BE, Extended Request ID: QonH0Konw+DJAtrVirDgZ7m60L/8fKKXPfh3xZEWFbMgUeh7sM70XWM4tu/iotJdreravwGh0U4=, Cloud Provider: AWS, Instance ID: i-0bfe2e8c5979ba122 credentials-provider: com.amazonaws.auth.BasicAWSCredentials credential-header: AWS4-HMAC-SHA256 Credential=AKIA4MTWJV3MHJUGAQMT/20240701/ap-southeast-1/s3/aws4_request signature-present: true (Service: Amazon S3; Status Code: 403; Error Code: InvalidAccessKeyId; Request ID: XT85JHKSN1MJM6BE; S3 Extended Request ID: QonH0Konw+DJAtrVirDgZ7m60L/8fKKXPfh3xZEWFbMgUeh7sM70XWM4tu/iotJdreravwGh0U4=; Proxy: null), S3 Extended Request ID: QonH0Konw+DJAtrVirDgZ7m60L/8fKKXPfh3xZEWFbMgUeh7sM70XWM4tu/iotJdreravwGh0U4=
	at com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleErrorResponse(AmazonHttpClient.java:1879)
	at com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleServiceErrorResponse(AmazonHttpClient.java:1418)
	at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeOneRequest(AmazonHttpClient.java:1387)
	at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:1157)
	at com.amazonaws.http.AmazonHttpClient$RequestExecutor.doExecute(AmazonHttpClient.java:814)
	at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeWithTimer(AmazonHttpClient.java:781)
	at com.amazonaws.http.AmazonHttpClient$RequestExecutor.execute(AmazonHttpClient.java:755)
	at com.amazonaws.http.AmazonHttpClient$RequestExecutor.access$500(AmazonHttpClient.java:715)
	at com.amazonaws.http.AmazonHttpClient$RequestExecutionBuilderImpl.execute(AmazonHttpClient.java:697)
	at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:561)
	at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:541)
	at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:5456)
	at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:5403)
	at com.amazonaws.services.s3.AmazonS3Client.getObject(AmazonS3Client.java:1524)
	at shaded.databricks.org.apache.hadoop.fs.s3a.S3AInputStream.lambda$reopen$0(S3AInputStream.java:278)
	at shaded.databricks.org.apache.hadoop.fs.s3a.Invoker.once(Invoker.java:133)
	... 90 more

Driver stacktrace:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 26.0 failed 4 times, most recent failure: Lost task 0.3 in stage 26.0 (TID 22) (ip-10-96-163-239.ap-southeast-1.compute.internal executor driver): com.databricks.sql.io.FileReadException: Error while reading file s3://avinashkhamanekar/tmp/test_table_original12/part-00000-36ee2e95-cfb1-449b-a986-21657cc01b22-c000.snappy.parquet. Access Denied against cloud provider
	at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1$$anon$2.logFileNameAndThrow(FileScanRDD.scala:724)
	at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1$$anon$2.getNext(FileScanRDD.scala:691)
	at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
	at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:818)
	at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.$anonfun$hasNext$1(FileScanRDD.scala:510)
	at scala.runtime.java8.JFunction0$mcZ$sp.apply(JFunction0$mcZ$sp.java:23)
	at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
	at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:501)
	at org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.hasNext(DataSourceScanExec.scala:2624)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown Source)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
	at org.apache.spark.sql.execution.WholeStageCodegenEvaluatorFactory$WholeStageCodegenPartitionEvaluator$$anon$1.hasNext(WholeStageCodegenEvaluatorFactory.scala:43)
	at org.apache.spark.sql.execution.collect.UnsafeRowBatchUtils$.$anonfun$encodeUnsafeRows$5(UnsafeRowBatchUtils.scala:88)
	at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
	at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
	at org.apache.spark.sql.execution.collect.UnsafeRowBatchUtils$.$anonfun$encodeUnsafeRows$3(UnsafeRowBatchUtils.scala:88)
	at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)

I have all the access permissions on both external locations so this error should not occur. Is this a limitation of shallow clone? Is there any documentation link for this limitation?

The cluster DBR version is 14.3LTS.

raphaelblg · ‎07-19-2024

Regarding your staments:

[UC_COMMAND_NOT_SUPPORTED.WITH_RECOMMENDATION] The command(s): input_file_name are not supported in Unity Catalog. Please use _metadata.file_path instead. SQLSTATE: 0AKUC

Could you please also try this as a POC at your end using two different external locations?

input_file_name is a reserved function and use to be available on older DBRs. In Databricks SQL and Databricks Runtime 13.3 LTS and above this function is deprecated. Please use _metadata.file_name. Source: input_file_name function.

I can try to setup a POC but before let's make sure we are not facing any other exceptions.

Also the delta log couldn't have been deleted (or vaccumed) since I am running all three commands one by one.

Yes the _delta_logs folder could've been deleted but this would throw a different exception. _delta_logs deletion won't be performed by any Databricks service unless a query is ran to delete the table or the folder is manually removed. VACUUM shouldn't be the cause of a _delta_logs folder deletion as it will skip all directories that begin with an underscore (_), which includes the _delta_log. Source: Vacuum a Delta table.

Please let me know if you're able to progress with your implementation.

Best regards,

Raphael Balogo
Sr. Technical Solutions Engineer
Databricks

View solution in original post

raphaelblg · ‎07-01-2024

Hello ,

This is an underlying exception that should occur with any SQL statement that require access to this file:

part-00000-36ee2e95-cfb1-449b-a986-21657cc01b22-c000.snappy.parquet

It looks like the Delta log is referencing a file that doesn't exist anymore. This could happen if the file was removed by vacuum or if the file was manually deleted.

Could you please share the results of the commands below?

# Check added files
(
    spark
    .read
    .json("s3://avinashkhamanekar/tmp/test_table_original12/_delta_log/", 
     pathGlobFilter="*.json")
    .withColumn("filename", f.input_file_name())
    .where("add.path LIKE '%part-00000-36ee2e95-cfb1-449b-a986-21657cc01b22- 
c000.snappy.parquet%'")
    .withColumn("timestamp", f.expr("from_unixtime(add.modificationTime / 1000)"))
    .select(
        "filename",
        "timestamp",
        "add",
    )
).display()

and

# Check removed files
(
    spark
    .read
    .json("s3://avinashkhamanekar/tmp/test_table_original12/_delta_log/", 
            pathGlobFilter="*.json")
    .withColumn("filename", f.input_file_name())
    .where("remove.path LIKE '%part-00000-36ee2e95-cfb1-449b-a986-21657cc01b22- 
c000.snappy.parquet%'")
    .withColumn("timestamp", f.expr("from_unixtime(remove.deletionTimestamp / 1000)"))
    .select(
        "filename",
        "timestamp",
        "remove"
    )
).display()

Best regards,

Raphael Balogo
Sr. Technical Solutions Engineer
Databricks

dream · ‎07-03-2024

I got these errors while running both commands

[UC_COMMAND_NOT_SUPPORTED.WITH_RECOMMENDATION] The command(s): input_file_name are not supported in Unity Catalog. Please use _metadata.file_path instead. SQLSTATE: 0AKUC

Could you please also try this as a POC at your end using two different external locations?

dream · ‎07-03-2024

Also the delta log couldn't have been deleted (or vaccumed) since I am running all three commands one by one.

raphaelblg · ‎07-19-2024

Regarding your staments:

[UC_COMMAND_NOT_SUPPORTED.WITH_RECOMMENDATION] The command(s): input_file_name are not supported in Unity Catalog. Please use _metadata.file_path instead. SQLSTATE: 0AKUC

Could you please also try this as a POC at your end using two different external locations?

input_file_name is a reserved function and use to be available on older DBRs. In Databricks SQL and Databricks Runtime 13.3 LTS and above this function is deprecated. Please use _metadata.file_name. Source: input_file_name function.

I can try to setup a POC but before let's make sure we are not facing any other exceptions.

Also the delta log couldn't have been deleted (or vaccumed) since I am running all three commands one by one.

Yes the _delta_logs folder could've been deleted but this would throw a different exception. _delta_logs deletion won't be performed by any Databricks service unless a query is ran to delete the table or the folder is manually removed. VACUUM shouldn't be the cause of a _delta_logs folder deletion as it will skip all directories that begin with an underscore (_), which includes the _delta_log. Source: Vacuum a Delta table.

Please let me know if you're able to progress with your implementation.

Best regards,

Raphael Balogo
Sr. Technical Solutions Engineer
Databricks