topic Re: Unable to load Parquet file using Autoloader. Can someone help? in Data Engineering

Unable to load Parquet file using Autoloader. Can someone help?

Mayank — Sun, 26 Jun 2022 21:54:28 GMT

I am trying to load parquet files using Autoloader. Below is the code

def autoload_to_table (data_source, source_format, table_name, checkpoint_path):
    query = (spark.readStream
                  .format('cloudFiles')
                  .option('cloudFiles.format', source_format)
                  .schema("VendorID long,tpep_pickup_datetime timestamp, tpep_dropoff_datetime timestamp, passenger_count long, trip_distance long, RateCodeID long,  Store_and_fwd_flag string,PULocationID int, DOLocationID long, payment_type long, fare_amount long, extra long, mta_tax long,Tip_amount long, tolls_amount long, improvement_surcharge long,  total_amount long, congestion_Surcharge long, airport_fee long ")
                  .option('cloudFiles.schemaLocation', checkpoint_path)
                  .load(data_source)
                  .writeStream
                  .option('checkpointLocation', checkpoint_path)
                  .option('mergeSchema', "true")
                  .table(table_name)
            )
    
    return query
 
query = autoload_to_table (data_source = "/mnt/landing/nyctaxi",
                           source_format = "parquet",
                           table_name = "yellow_trip_data",
                           checkpoint_path='/tmp/delta/yellowdata/_checkpoints'
                          )

However, I run into the following error. i have also attached the ipython notebook/

Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 3011.0 failed 4 times, most recent failure: Lost task 0.3 in stage 3011.0 (TID 11673) (10.139.64.5 executor 0): java.lang.UnsupportedOperationException: org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainDoubleDictionary

at org.apache.parquet.column.Dictionary.decodeToLong(Dictionary.java:49)

at org.apache.spark.sql.execution.datasources.parquet.ParquetDictionary.decodeToLong(ParquetDictionary.java:54)

Re: Unable to load Parquet file using Autoloader. Can someone help?

-werners- — Mon, 27 Jun 2022 07:19:24 GMT

it could be an incompatible schema,

there is a knowledge base article about that.

Re: Unable to load Parquet file using Autoloader. Can someone help?

Hubert-Dudek — Mon, 27 Jun 2022 14:50:05 GMT

As @Werner Stinckens said.

Just load your file the normal way (spark.read.parquet ) without specifying schema and then extract DDL.

schema_json = spark.read.parquet("your_file.parquet").schema.json()
ddl = spark.sparkContext._jvm.org.apache.spark.sql.types.DataType.fromJson(schema_json).toDDL()
print(ddl)

Re: Unable to load Parquet file using Autoloader. Can someone help?

Mayank — Mon, 27 Jun 2022 15:25:32 GMT

Smart idea. Let me try this one. @Hubert Dudek

Re: Unable to load Parquet file using Autoloader. Can someone help?

Mayank — Mon, 27 Jun 2022 15:45:01 GMT

This ran !!! you are awesome @Hubert Dudek

Re: Unable to load Parquet file using Autoloader. Can someone help?

Anonymous — Mon, 27 Jun 2022 17:15:21 GMT

Hey @Mayank Srivastava

Hope you are well!

We are happy to know that you were able to resolve your issue. It would be really awesome if you could mark the answer as best. It would be really helpful for the other members too.

Cheers!

Re: Unable to load Parquet file using Autoloader. Can someone help?

Mayank — Mon, 27 Jun 2022 17:49:56 GMT

Done !!

Re: Unable to load Parquet file using Autoloader. Can someone help?

Anonymous — Mon, 27 Jun 2022 18:06:59 GMT

Hi again @Mayank Srivastava

Thank you so much for getting back to us and marking the answer as best.

We really appreciate your time.

Wish you a great Databricks journey ahead!

Re: Unable to load Parquet file using Autoloader. Can someone help?

Hubert-Dudek — Tue, 28 Jun 2022 13:50:12 GMT

Great! Thank you.