โ05-11-2023 04:51 AM
Hi all, i have some questions related to the handling of many smalls files and possible improvements and augmentations.
We have many small xml files. These files are previously processed by another system that puts them in our datalake, but as an additional step wraps the initial xml files in json files with some metadata. This json is flat and composed in jsonl format.
We currently use ADF to get triggered by the new files, after which it adds them to a delta lake table (DT1). This table has a column for each field, of which one is the original xml data. These xml files can be quite big, so this table with about 20M rows is 150GB.
The step after this table is to unpack the xml fields with a UDF and we get an usable version of the data inside the XML, also as delta table (DT2). The problem is that this initial delta table is extremely slow and big. Just doing a count takes 5 minutes. The table is currently not partitioned. We can add this, but apart from faster search I'm not sure if that helps other queries such as a count. I had a hard time finding info about which metadata is computed when and used when in delta lake.
We have considered auto-loader but since DLT is in it's very early stages we were not super keen on using this. The additional issue of not being able to run things interactively also seemed like a pretty ineffective structure, but maybe I'm missing something and this is the way to go. Any thoughts on better ways are appreciated.
All our steps after the initial ADF load are using pyspark structured streaming. Just running the setup of DT1 to DT2 takes about 2 days with a fairly big cluster, whereas we could process these using dbutils and the python variant of the UDF in about a day with a single node. So there seems to be some fundamental slowdowns that I am missing in the pyspark processing.
I have a hard time profiling the spark UI and understanding what is happing and where it is slowing down. This has helped in not really knowing how to move forward. I have looked at the pyspark profiler, but yet to run it. This could be helpful although seeing what I'm expecting, which is that the UDF is slow, is not gonna help per se.
Execution plan from DT1 to DT2:
== Physical Plan ==
*(1) Project [pythonUDF0#24304.Col1 AS COl1#24048, ... 97 more fields]
+- BatchEvalPython [<lambda>(body#24021, path#24025)#24037], [pythonUDF0#24304]
+- InMemoryTableScan [body#24021, path#24025], false
+- InMemoryRelation [body#24021, checksum#24022, format#24023, id#24024, path#24025, size#24026, timestamp#24027], StorageLevel(disk, memory, deserialized, 1 replicas)
+- *(1) ColumnarToRow
+- FileScan parquet [body#249,checksum#250,format#251,id#252,path#253,size#254,timestamp#255] Batched: true, DataFilters: [], Format: Parquet, Location: PreparedDeltaFileIndex(1 paths)[abfss://container@datalake.dfs.core.windows.net/bronz..., PartitionFilters: [], PushedFilters: [], ReadSchema: struct<body:string,checksum:string,format:string,id:string,path:string,size:int,timestamp:string>
Looking at this now, it seems that the UDF is applied at the first step rather than maybe splitting the data and then applying the UDF. Could that be an improvement?
โ05-13-2023 09:17 AM
@Bauke Brenninkmeijerโ :
It sounds like you are facing a few performance and scalability challenges in your current process. Here are some thoughts and suggestions:
I hope these suggestions help you optimize your processing pipeline and improve performance and scalability.
โ05-13-2023 09:17 AM
@Bauke Brenninkmeijerโ :
It sounds like you are facing a few performance and scalability challenges in your current process. Here are some thoughts and suggestions:
I hope these suggestions help you optimize your processing pipeline and improve performance and scalability.
โ05-21-2023 11:56 PM
Hi @Bauke Brenninkmeijerโ
Thank you for posting your question in our community! We are happy to assist you.
To help us provide you with the most accurate information, could you please take a moment to review the responses and select the one that best answers your question?
This will also help other community members who may have similar questions in the future. Thank you for your participation and let us know if you need any further assistance!
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโt want to miss the chance to attend and share knowledge.
If there isnโt a group near you, start one and help create a community that brings people together.
Request a New Group