I am trying to read an external iceberg database from s3 location using the follwing command
df_source = (spark.read.format("iceberg")
.load(source_s3_path)
.drop(*source_drop_columns)
.filter(f"{date_column}<='{date_filter}'")
)
But I get the following error:
Py4JJavaError: An error occurred while calling o632.load.
: java.util.NoSuchElementException: None.get
at scala.None$.get(Option.scala:529)
at scala.None$.get(Option.scala:527)
at org.apache.spark.sql.execution.datasources.v2.DataSourceV2Utils$.loadV2Source(DataSourceV2Utils.scala:136)
at org.apache.spark.sql.DataFrameReader.$anonfun$load$1(DataFrameReader.scala:323)
at scala.Option.flatMap(Option.scala:271)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:321)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:237)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:380)
at py4j.Gateway.invoke(Gateway.java:306)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:195)
at py4j.ClientServerConnection.run(ClientServerConnection.java:115)
at java.lang.Thread.run(Thread.java:750)
If I change the format to parquet in the code above it brings all history records, which what i would like to avoid by using its original format
I have installed the iceberg library iceberg-spark-runtime-3.3_2.12 in my cluster and added the following parameters to the advance config:
spark.sql.catalog.spark_catalog org.apache.iceberg.spark.SparkCatalog
spark.sql.catalog.spark_catalog.type hadoop
spark.sql.catalog.spark_catalog.warehouse /<folder for iceberg data>/
But I cannot make it work, so not sure if those steps are required (get it from an article by Dremio) or other config is needed. Please let me know if this can be done