Medaillon architecture

patacoing — Wed, 09 Apr 2025 21:34:56 GMT

Hello, I have in a S3 data lake, in it: a structure of files that are of different formats : json, csv, text, binary, ...

Would you consider this as my bronze layer ? or a "pre-bronze" layer since it can't be processed directly by spark (because of different file format ?)
How am I supposed to query and trasform that data with databricks since it's from different format ?

Should I instead firstly transform data to put into a delta table with some columns like :

- metadata (map column)

- content (binary column)

In this case, would Autoloader be relevant ?

Re: Medaillon architecture

Brahmareddy — Thu, 10 Apr 2025 02:53:00 GMT

Hi patacoing,

How are you doing today?, As per my understanding, The structure you described in your S3 data lake sounds more like a "pre-bronze" layer, since the files are in mixed formats (JSON, CSV, text, binary), which makes it tricky to process them directly with Spark in a uniform way. In Databricks, your bronze layer is usually where data becomes readable and queryable—often standardized into Delta format. A good approach is to use Autoloader to ingest each file type separately by setting the correct format (like .format("cloudFiles").option("cloudFiles.format", "json") and so on), and then write them into a bronze Delta table with consistent schema. If formats are very inconsistent or unknown, you could even store the raw content in a Delta table using a binary column along with a metadata map to track file info. This lets you store everything safely and do transformations later in silver/gold layers. So yes, Autoloader is definitely still relevant—you just need to process one format at a time or wrap each file’s raw content smartly. Let me know if you'd like a sample bronze setup based on your structure!

Regards,

Brahma

topic Re: Medaillon architecture in Data Engineering

Medaillon architecture

Re: Medaillon architecture