Auto Loader's default behavior of sorting columns lexicographically during schema inference is indeed a limitation when preserving the original order of JSON fields is important. Unfortunately, there isn't a built-in option in Auto Loader to maintain the original column order from JSON files while using automatic schema inference.However, there are a few workarounds you can consider:
1. Explicitly Define the Schema
While this approach doesn't fully leverage Auto Loader's schema inference capabilities, it allows you to maintain control over the column order:
from pyspark.sql.types import StructType, StructField, IntegerType
schema = StructType([
StructField("colB", IntegerType(), True),
StructField("colC", IntegerType(), True),
StructField("colA", IntegerType(), True)
])
@Dlt.table(table_properties={'quality': 'bronze'})
def my_table():
return (
spark.readStream.format('cloudFiles')
.option('cloudFiles.format', 'json')
.schema(schema)
.load('s3://my_bucket/my_table/')
)
2. Use a Post-Processing Step
You can leverage Auto Loader's schema inference and then reorder the columns in a subsequent step:
python
from pyspark.sql.functions import col
@Dlt.table(table_properties={'quality': 'bronze'})
def my_table():
df = (
spark.readStream.format('cloudFiles')
.option('cloudFiles.format', 'json')
.load('s3://my_bucket/my_table/')
)
# Define the desired column order
desired_order = ["colB", "colC", "colA"]
# Reorder columns
return df.select([col(c) for c in desired_order])