07-01-2024 12:18 AM - edited 07-01-2024 12:23 AM
I have two external locations. On both of these locations I have `ALL PRIVILEGES` access.
I am creating a table on the first external location using the following command:
%sql
create or replace table delta.`s3://avinashkhamanekar/tmp/test_table_original12`
as
select * from range(100000)
Next, I am creating a shallow clone of this table in an another external location.
%sql
create or replace table delta.`s3://tupboto3harsh/tmp/test_table_cloned12`
shallow clone
delta.`s3://avinashkhamanekar/tmp/test_table_original12`
Both of these commands run successfully. But when I try to access the shallow cloned table I get the following error:
%sql
select * from delta.`s3://tupboto3harsh/tmp/test_table_cloned12`
SparkException: Job aborted due to stage failure: Task 0 in stage 26.0 failed 4 times, most recent failure: Lost task 0.3 in stage 26.0 (TID 22) (ip-10-96-163-239.ap-southeast-1.compute.internal executor driver): com.databricks.sql.io.FileReadException: Error while reading file s3://avinashkhamanekar/tmp/test_table_original12/part-00000-36ee2e95-cfb1-449b-a986-21657cc01b22-c000.snappy.parquet. Access Denied against cloud provider
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1$$anon$2.logFileNameAndThrow(FileScanRDD.scala:724)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1$$anon$2.getNext(FileScanRDD.scala:691)
at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:818)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.$anonfun$hasNext$1(FileScanRDD.scala:510)
at scala.runtime.java8.JFunction0$mcZ$sp.apply(JFunction0$mcZ$sp.java:23)
at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:501)
at org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.hasNext(DataSourceScanExec.scala:2624)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown Source)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenEvaluatorFactory$WholeStageCodegenPartitionEvaluator$$anon$1.hasNext(WholeStageCodegenEvaluatorFactory.scala:43)
at org.apache.spark.sql.execution.collect.UnsafeRowBatchUtils$.$anonfun$encodeUnsafeRows$5(UnsafeRowBatchUtils.scala:88)
at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
at org.apache.spark.sql.execution.collect.UnsafeRowBatchUtils$.$anonfun$encodeUnsafeRows$3(UnsafeRowBatchUtils.scala:88)
at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
at org.apache.spark.sql.execution.collect.UnsafeRowBatchUtils$.$anonfun$encodeUnsafeRows$1(UnsafeRowBatchUtils.scala:68)
at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
at org.apache.spark.sql.execution.collect.UnsafeRowBatchUtils$.encodeUnsafeRows(UnsafeRowBatchUtils.scala:62)
at org.apache.spark.sql.execution.collect.Collector.$anonfun$processFunc$2(Collector.scala:214)
at org.apache.spark.scheduler.ResultTask.$anonfun$runTask$3(ResultTask.scala:82)
at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
at org.apache.spark.scheduler.ResultTask.$anonfun$runTask$1(ResultTask.scala:82)
at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
at org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:201)
at org.apache.spark.scheduler.Task.doRunTask(Task.scala:186)
at org.apache.spark.scheduler.Task.$anonfun$run$5(Task.scala:151)
at com.databricks.unity.UCSEphemeralState$Handle.runWith(UCSEphemeralState.scala:45)
at com.databricks.unity.HandleImpl.runWith(UCSHandle.scala:103)
at com.databricks.unity.HandleImpl.$anonfun$runWithAndClose$1(UCSHandle.scala:108)
at scala.util.Using$.resource(Using.scala:269)
at com.databricks.unity.HandleImpl.runWithAndClose(UCSHandle.scala:107)
at org.apache.spark.scheduler.Task.$anonfun$run$1(Task.scala:145)
at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
at org.apache.spark.scheduler.Task.run(Task.scala:99)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$9(Executor.scala:958)
at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally(SparkErrorUtils.scala:64)
at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally$(SparkErrorUtils.scala:61)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:105)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:961)
at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:853)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:750)
Caused by: org.apache.spark.SparkRuntimeException: [CLOUD_PROVIDER_ERROR] Cloud provider error: AmazonS3Exception: The AWS Access Key Id you provided does not exist in our records.; request: GET https://avinashkhamanekar.s3.ap-southeast-1.amazonaws.com tmp/test_table_original12/part-00000-36ee2e95-cfb1-449b-a986-21657cc01b22-c000.snappy.parquet {} Hadoop 3.3.6, aws-sdk-java/1.12.390 Linux/5.15.0-1058-aws OpenJDK_64-Bit_Server_VM/25.392-b08 java/1.8.0_392 scala/2.12.15 kotlin/1.6.0 vendor/Azul_Systems,_Inc. cfg/retry-mode/legacy com.amazonaws.services.s3.model.GetObjectRequest; Request ID: XT85JHKSN1MJM6BE, Extended Request ID: QonH0Konw+DJAtrVirDgZ7m60L/8fKKXPfh3xZEWFbMgUeh7sM70XWM4tu/iotJdreravwGh0U4=, Cloud Provider: AWS, Instance ID: i-0bfe2e8c5979ba122 credentials-provider: com.amazonaws.auth.BasicAWSCredentials credential-header: AWS4-HMAC-SHA256 Credential=AKIA4MTWJV3MHJUGAQMT/20240701/ap-southeast-1/s3/aws4_request signature-present: true (Service: Amazon S3; Status Code: 403; Error Code: InvalidAccessKeyId; Request ID: XT85JHKSN1MJM6BE; S3 Extended Request ID: QonH0Konw+DJAtrVirDgZ7m60L/8fKKXPfh3xZEWFbMgUeh7sM70XWM4tu/iotJdreravwGh0U4=; Proxy: null) SQLSTATE: 58000
at org.apache.spark.sql.errors.QueryExecutionErrors$.cloudProviderError(QueryExecutionErrors.scala:1487)
... 49 more
Caused by: java.nio.file.AccessDeniedException: s3a://avinashkhamanekar/tmp/test_table_original12/part-00000-36ee2e95-cfb1-449b-a986-21657cc01b22-c000.snappy.parquet: open s3a://avinashkhamanekar/tmp/test_table_original12/part-00000-36ee2e95-cfb1-449b-a986-21657cc01b22-c000.snappy.parquet at 0 on s3a://avinashkhamanekar/tmp/test_table_original12/part-00000-36ee2e95-cfb1-449b-a986-21657cc01b22-c000.snappy.parquet: com.amazonaws.services.s3.model.AmazonS3Exception: The AWS Access Key Id you provided does not exist in our records.; request: GET https://avinashkhamanekar.s3.ap-southeast-1.amazonaws.com tmp/test_table_original12/part-00000-36ee2e95-cfb1-449b-a986-21657cc01b22-c000.snappy.parquet {} Hadoop 3.3.6, aws-sdk-java/1.12.390 Linux/5.15.0-1058-aws OpenJDK_64-Bit_Server_VM/25.392-b08 java/1.8.0_392 scala/2.12.15 kotlin/1.6.0 vendor/Azul_Systems,_Inc. cfg/retry-mode/legacy com.amazonaws.services.s3.model.GetObjectRequest; Request ID: XT85JHKSN1MJM6BE, Extended Request ID: QonH0Konw+DJAtrVirDgZ7m60L/8fKKXPfh3xZEWFbMgUeh7sM70XWM4tu/iotJdreravwGh0U4=, Cloud Provider: AWS, Instance ID: i-0bfe2e8c5979ba122 credentials-provider: com.amazonaws.auth.BasicAWSCredentials credential-header: AWS4-HMAC-SHA256 Credential=AKIA4MTWJV3MHJUGAQMT/20240701/ap-southeast-1/s3/aws4_request signature-present: true (Service: Amazon S3; Status Code: 403; Error Code: InvalidAccessKeyId; Request ID: XT85JHKSN1MJM6BE; S3 Extended Request ID: QonH0Konw+DJAtrVirDgZ7m60L/8fKKXPfh3xZEWFbMgUeh7sM70XWM4tu/iotJdreravwGh0U4=; Proxy: null), S3 Extended Request ID: QonH0Konw+DJAtrVirDgZ7m60L/8fKKXPfh3xZEWFbMgUeh7sM70XWM4tu/iotJdreravwGh0U4=:InvalidAccessKeyId
at shaded.databricks.org.apache.hadoop.fs.s3a.S3AUtils.translateException(S3AUtils.java:292)
at shaded.databricks.org.apache.hadoop.fs.s3a.Invoker.once(Invoker.java:135)
at shaded.databricks.org.apache.hadoop.fs.s3a.Invoker.once(Invoker.java:127)
at shaded.databricks.org.apache.hadoop.fs.s3a.S3AInputStream.reopen(S3AInputStream.java:277)
at shaded.databricks.org.apache.hadoop.fs.s3a.S3AInputStream.lambda$lazySeek$2(S3AInputStream.java:466)
at shaded.databricks.org.apache.hadoop.fs.s3a.Invoker.lambda$maybeRetry$3(Invoker.java:246)
at shaded.databricks.org.apache.hadoop.fs.s3a.Invoker.once(Invoker.java:133)
at shaded.databricks.org.apache.hadoop.fs.s3a.Invoker.once(Invoker.java:127)
at shaded.databricks.org.apache.hadoop.fs.s3a.Invoker.lambda$maybeRetry$5(Invoker.java:370)
at shaded.databricks.org.apache.hadoop.fs.s3a.Invoker.retryUntranslated(Invoker.java:434)
at shaded.databricks.org.apache.hadoop.fs.s3a.Invoker.maybeRetry(Invoker.java:366)
at shaded.databricks.org.apache.hadoop.fs.s3a.Invoker.maybeRetry(Invoker.java:244)
at shaded.databricks.org.apache.hadoop.fs.s3a.Invoker.maybeRetry(Invoker.java:288)
at shaded.databricks.org.apache.hadoop.fs.s3a.S3AInputStream.lazySeek(S3AInputStream.java:459)
at shaded.databricks.org.apache.hadoop.fs.s3a.S3AInputStream.read(S3AInputStream.java:571)
at java.io.DataInputStream.read(DataInputStream.java:149)
at com.databricks.common.filesystem.LokiS3AInputStream.$anonfun$read$3(LokiS3FS.scala:254)
at scala.runtime.java8.JFunction0$mcI$sp.apply(JFunction0$mcI$sp.java:23)
at com.databricks.common.filesystem.LokiS3AInputStream.withExceptionRewrites(LokiS3FS.scala:244)
at com.databricks.common.filesystem.LokiS3AInputStream.read(LokiS3FS.scala:254)
at java.io.DataInputStream.read(DataInputStream.java:149)
at com.databricks.spark.metrics.FSInputStreamWithMetrics.$anonfun$read$3(FileSystemWithMetrics.scala:90)
at com.databricks.spark.metrics.FSInputStreamWithMetrics.withTimeAndBytesReadMetric(FileSystemWithMetrics.scala:67)
at com.databricks.spark.metrics.FSInputStreamWithMetrics.read(FileSystemWithMetrics.scala:90)
at java.io.DataInputStream.read(DataInputStream.java:149)
at com.databricks.sql.io.HDFSStorage.lambda$fetchRange$1(HDFSStorage.java:88)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at com.databricks.sql.io.HDFSStorage.submit(HDFSStorage.java:119)
at com.databricks.sql.io.HDFSStorage.fetchRange(HDFSStorage.java:82)
at com.databricks.sql.io.Storage.fetchRange(Storage.java:183)
at com.databricks.sql.io.parquet.CachingParquetFileReader$FooterByteReader.readTail(CachingParquetFileReader.java:345)
at com.databricks.sql.io.parquet.CachingParquetFileReader$FooterByteReader.read(CachingParquetFileReader.java:363)
at com.databricks.sql.io.parquet.CachingParquetFooterReader.lambda$null$1(CachingParquetFooterReader.java:231)
at com.databricks.sql.io.caching.NativeDiskCache$.get(Native Method)
at com.databricks.sql.io.caching.DiskCache.get(DiskCache.scala:568)
at com.databricks.sql.io.parquet.CachingParquetFooterReader.lambda$readFooterFromStorage$2(CachingParquetFooterReader.java:234)
at org.apache.spark.util.JavaFrameProfiler.record(JavaFrameProfiler.java:18)
at com.databricks.sql.io.parquet.CachingParquetFooterReader.readFooterFromStorage(CachingParquetFooterReader.java:214)
at com.databricks.sql.io.parquet.CachingParquetFooterReader.readFooter(CachingParquetFooterReader.java:134)
at com.databricks.sql.io.parquet.CachingParquetFileReader.readFooter(CachingParquetFileReader.java:392)
at org.apache.spark.sql.execution.datasources.parquet.SpecificParquetRecordReaderBase.prepare(SpecificParquetRecordReaderBase.java:162)
at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anon$1.apply(ParquetFileFormat.scala:415)
at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anon$1.apply(ParquetFileFormat.scala:259)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1$$anon$2.getNext(FileScanRDD.scala:608)
... 48 more
Caused by: com.amazonaws.services.s3.model.AmazonS3Exception: The AWS Access Key Id you provided does not exist in our records.; request: GET https://avinashkhamanekar.s3.ap-southeast-1.amazonaws.com tmp/test_table_original12/part-00000-36ee2e95-cfb1-449b-a986-21657cc01b22-c000.snappy.parquet {} Hadoop 3.3.6, aws-sdk-java/1.12.390 Linux/5.15.0-1058-aws OpenJDK_64-Bit_Server_VM/25.392-b08 java/1.8.0_392 scala/2.12.15 kotlin/1.6.0 vendor/Azul_Systems,_Inc. cfg/retry-mode/legacy com.amazonaws.services.s3.model.GetObjectRequest; Request ID: XT85JHKSN1MJM6BE, Extended Request ID: QonH0Konw+DJAtrVirDgZ7m60L/8fKKXPfh3xZEWFbMgUeh7sM70XWM4tu/iotJdreravwGh0U4=, Cloud Provider: AWS, Instance ID: i-0bfe2e8c5979ba122 credentials-provider: com.amazonaws.auth.BasicAWSCredentials credential-header: AWS4-HMAC-SHA256 Credential=AKIA4MTWJV3MHJUGAQMT/20240701/ap-southeast-1/s3/aws4_request signature-present: true (Service: Amazon S3; Status Code: 403; Error Code: InvalidAccessKeyId; Request ID: XT85JHKSN1MJM6BE; S3 Extended Request ID: QonH0Konw+DJAtrVirDgZ7m60L/8fKKXPfh3xZEWFbMgUeh7sM70XWM4tu/iotJdreravwGh0U4=; Proxy: null), S3 Extended Request ID: QonH0Konw+DJAtrVirDgZ7m60L/8fKKXPfh3xZEWFbMgUeh7sM70XWM4tu/iotJdreravwGh0U4=
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleErrorResponse(AmazonHttpClient.java:1879)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleServiceErrorResponse(AmazonHttpClient.java:1418)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeOneRequest(AmazonHttpClient.java:1387)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:1157)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.doExecute(AmazonHttpClient.java:814)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeWithTimer(AmazonHttpClient.java:781)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.execute(AmazonHttpClient.java:755)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.access$500(AmazonHttpClient.java:715)
at com.amazonaws.http.AmazonHttpClient$RequestExecutionBuilderImpl.execute(AmazonHttpClient.java:697)
at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:561)
at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:541)
at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:5456)
at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:5403)
at com.amazonaws.services.s3.AmazonS3Client.getObject(AmazonS3Client.java:1524)
at shaded.databricks.org.apache.hadoop.fs.s3a.S3AInputStream.lambda$reopen$0(S3AInputStream.java:278)
at shaded.databricks.org.apache.hadoop.fs.s3a.Invoker.once(Invoker.java:133)
... 90 more
Driver stacktrace:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 26.0 failed 4 times, most recent failure: Lost task 0.3 in stage 26.0 (TID 22) (ip-10-96-163-239.ap-southeast-1.compute.internal executor driver): com.databricks.sql.io.FileReadException: Error while reading file s3://avinashkhamanekar/tmp/test_table_original12/part-00000-36ee2e95-cfb1-449b-a986-21657cc01b22-c000.snappy.parquet. Access Denied against cloud provider
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1$$anon$2.logFileNameAndThrow(FileScanRDD.scala:724)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1$$anon$2.getNext(FileScanRDD.scala:691)
at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:818)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.$anonfun$hasNext$1(FileScanRDD.scala:510)
at scala.runtime.java8.JFunction0$mcZ$sp.apply(JFunction0$mcZ$sp.java:23)
at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:501)
at org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.hasNext(DataSourceScanExec.scala:2624)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown Source)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenEvaluatorFactory$WholeStageCodegenPartitionEvaluator$$anon$1.hasNext(WholeStageCodegenEvaluatorFactory.scala:43)
at org.apache.spark.sql.execution.collect.UnsafeRowBatchUtils$.$anonfun$encodeUnsafeRows$5(UnsafeRowBatchUtils.scala:88)
at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
at org.apache.spark.sql.execution.collect.UnsafeRowBatchUtils$.$anonfun$encodeUnsafeRows$3(UnsafeRowBatchUtils.scala:88)
at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
I have all the access permissions on both external locations so this error should not occur. Is this a limitation of shallow clone? Is there any documentation link for this limitation?
The cluster DBR version is 14.3LTS.
07-19-2024 09:06 AM - edited 07-19-2024 09:08 AM
Regarding your staments:
[UC_COMMAND_NOT_SUPPORTED.WITH_RECOMMENDATION] The command(s): input_file_name are not supported in Unity Catalog. Please use _metadata.file_path instead. SQLSTATE: 0AKUC
Could you please also try this as a POC at your end using two different external locations?
input_file_name is a reserved function and use to be available on older DBRs. In Databricks SQL and Databricks Runtime 13.3 LTS and above this function is deprecated. Please use _metadata.file_name. Source: input_file_name function.
I can try to setup a POC but before let's make sure we are not facing any other exceptions.
Also the delta log couldn't have been deleted (or vaccumed) since I am running all three commands one by one.
VACUUM
shouldn't be the cause of a _delta_logs folder deletion as it will skip all directories that begin with an underscore (_
), which includes the _delta_log
. Source: Vacuum a Delta table.Please let me know if you're able to progress with your implementation.
07-01-2024 03:03 PM
Hello ,
This is an underlying exception that should occur with any SQL statement that require access to this file:
part-00000-36ee2e95-cfb1-449b-a986-21657cc01b22-c000.snappy.parquet
It looks like the Delta log is referencing a file that doesn't exist anymore. This could happen if the file was removed by vacuum or if the file was manually deleted.
Could you please share the results of the commands below?
# Check added files
(
spark
.read
.json("s3://avinashkhamanekar/tmp/test_table_original12/_delta_log/",
pathGlobFilter="*.json")
.withColumn("filename", f.input_file_name())
.where("add.path LIKE '%part-00000-36ee2e95-cfb1-449b-a986-21657cc01b22-
c000.snappy.parquet%'")
.withColumn("timestamp", f.expr("from_unixtime(add.modificationTime / 1000)"))
.select(
"filename",
"timestamp",
"add",
)
).display()
and
# Check removed files
(
spark
.read
.json("s3://avinashkhamanekar/tmp/test_table_original12/_delta_log/",
pathGlobFilter="*.json")
.withColumn("filename", f.input_file_name())
.where("remove.path LIKE '%part-00000-36ee2e95-cfb1-449b-a986-21657cc01b22-
c000.snappy.parquet%'")
.withColumn("timestamp", f.expr("from_unixtime(remove.deletionTimestamp / 1000)"))
.select(
"filename",
"timestamp",
"remove"
)
).display()
07-03-2024 03:21 AM - edited 07-03-2024 03:22 AM
I got these errors while running both commands
[UC_COMMAND_NOT_SUPPORTED.WITH_RECOMMENDATION] The command(s): input_file_name are not supported in Unity Catalog. Please use _metadata.file_path instead. SQLSTATE: 0AKUC
Could you please also try this as a POC at your end using two different external locations?
07-03-2024 03:31 AM - edited 07-03-2024 03:33 AM
Also the delta log couldn't have been deleted (or vaccumed) since I am running all three commands one by one.
07-19-2024 09:06 AM - edited 07-19-2024 09:08 AM
Regarding your staments:
[UC_COMMAND_NOT_SUPPORTED.WITH_RECOMMENDATION] The command(s): input_file_name are not supported in Unity Catalog. Please use _metadata.file_path instead. SQLSTATE: 0AKUC
Could you please also try this as a POC at your end using two different external locations?
input_file_name is a reserved function and use to be available on older DBRs. In Databricks SQL and Databricks Runtime 13.3 LTS and above this function is deprecated. Please use _metadata.file_name. Source: input_file_name function.
I can try to setup a POC but before let's make sure we are not facing any other exceptions.
Also the delta log couldn't have been deleted (or vaccumed) since I am running all three commands one by one.
VACUUM
shouldn't be the cause of a _delta_logs folder deletion as it will skip all directories that begin with an underscore (_
), which includes the _delta_log
. Source: Vacuum a Delta table.Please let me know if you're able to progress with your implementation.
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.
If there isn’t a group near you, start one and help create a community that brings people together.
Request a New Group