Databricks

DataBRObin · ‎08-04-2022

For a computer vision project, my raw data consists of encrypted videos (60fps) stored in Azure Blob Storage. In order to have the data usable for model training, I need to do some preprocessing and for that I need the video split into individual frames. The video is encrypted but I can pass the encryption key to FFmpeg to decrypt the video file. I've found a way to 'pipe' the output of FFmpeg (individual frames) to stdout which can be picked up through the python ffmpeg library. The issue is that even for a few minutes of video, a cluster with 112GB of RAM already runs into OOM errors.

What I've tried:

Original approach: Using foreach on an RDD with the video locations to extract the frames and write them to a new location in blob storage. This works though it's quite slow and the huge number of images of small size makes this approach prohibitively expensive on blob storage (which my boss obviously doesn't like).
Pandas UDF in python returning the images. Works but sometimes even on a minute of video it runs out of memory.
Add column with start and end time for each video (e.g. video of 1 minute split up in 5 second parts). My thought was that this reduces the amount of memory used per row sent into the pandas UDF but it seems like FFmpeg still scans through the whole video and thus uses quite some memory.
Run in Pandas. Works more reliably but uses a lot of memory (as pandas DFs are fully stored in memory) and transforming the pandas dataframe into a pyspark DF uses a lot of additional memory and takes time, also making it a non-ideal option.

What I want:

A way to extract frames from a video and pipe the resulting frames into a pyspark DF for further processing, without needing a huge cluster that can only preprocess some minutes of video at a time (making it prohibitively expensive). I am very open for using different libraries and different configurations, any way that can allow me to do the task I want to do is something I want to try!

Sample of row of data:

container_name: "sample_container_name"

filename: "/dbfs/mnt/.../.../......./video.mp4x"

duration: 25.6

height: 1080

width: 1920

# tried with intervals (in seconds, meaning with interval 5 seconds = 300 frames per interval)

start: 0

end: 5

-werners- · ‎08-04-2022

ffmpeg does not run in a distributed manner afaik. So it will run on the driver.

If you want to run ffmpeg in parallel, you would have to find some distributed version of it. I have found some github projects but they seem pretty dead.

Perhaps you can take a look into OpenCV.

Some time ago I spoke with people who do AI (computer vision) on video, and they used Nvidia Deepstream. Might be worth looking into.

DataBRObin · ‎08-04-2022

I believe that is true as well, since I am just using the python library implementation of it. If there would be a way to run it sequentially on the driver but once one row has been processed it hands it off to the executors again and materialises it outside of memory I would be fine with it (my main problem lies in memory usage, not speed/cpu utilisation), but I don't know if there is a way to do this.

I'll have a look at OpenCV, though last time I checked I couldn't get the decryption to work in there.

Deepstream seems a completely end-to-end system and since I have some bespoke preprocessing to do I don't think I can make that work (easily).

Thanks for your answer!

DataBRObin · ‎08-12-2022

In the end, I managed to get it to work. The way I did it is to only have one executor per cluster node (assign executor.cores to be the node core count -2 and memory as much as allowed) and repartition to more partitions than rows in the table and limit the number of videos to process at a time. If someone has a better idea that allows for more parallelisation than my solution then let me know!

Vidula · ‎09-07-2022

Hi @Robin G

Hope all is well! Just wanted to check in if you were able to resolve your issue and would you be happy to share the solution or mark an answer as best? Else please let us know if you need more help.

We'd love to hear from you.

Thanks!

DataBRObin · ‎09-09-2022

Hi @Vidula Khanna

Hope you're doing good too! I haven't really been able to solve my issue beside the workaround mentioned above, with a very low number of videos to process at at time with a large VM, so I still consider this issue unresolved (and I don't like to mark my own answer as best, especially when it's so far from perfect 🙂 ). If you are able to provide some more pointers to a better solution then that would be very welcome of course!

DataBRObin · ‎05-30-2023

In the end, I decided to change around the workflow so it is as efficient as I could imagine it:

Extract frames of video files in a containerized application somewhere running ffmpeg and storing the resulting frames in a parquet file in blob storage (lower IO fees)
Run preprocessing on the parquet files of the extracted images in Databricks.
Store results in parquet files in blob storage.

I was hoping there would be a less roundabout way but this is how I managed to solve it in a fairly (cost) effective way

Databricks

FFmpeg frame extraction explodes memory, how to mitigate?

Unity Catalog Lakeguard: Industry-first and only data governance for multi-user Apache™ Spark cluste

Announcing the General Availability of Databricks Asset Bundles

Register now and save 50% on training at Data + AI Summit!

How to successfully build GenAI applications

Meet DBRX, the New Standard for High-Quality LLMs