08-04-2022 01:02 AM
For a computer vision project, my raw data consists of encrypted videos (60fps) stored in Azure Blob Storage. In order to have the data usable for model training, I need to do some preprocessing and for that I need the video split into individual frames. The video is encrypted but I can pass the encryption key to FFmpeg to decrypt the video file. I've found a way to 'pipe' the output of FFmpeg (individual frames) to stdout which can be picked up through the python ffmpeg library. The issue is that even for a few minutes of video, a cluster with 112GB of RAM already runs into OOM errors.
What I've tried:
What I want:
A way to extract frames from a video and pipe the resulting frames into a pyspark DF for further processing, without needing a huge cluster that can only preprocess some minutes of video at a time (making it prohibitively expensive). I am very open for using different libraries and different configurations, any way that can allow me to do the task I want to do is something I want to try!
Sample of row of data:
container_name: "sample_container_name"
filename: "/dbfs/mnt/.../.../......./video.mp4x"
duration: 25.6
height: 1080
width: 1920
# tried with intervals (in seconds, meaning with interval 5 seconds = 300 frames per interval)
start: 0
end: 5
08-04-2022 03:31 AM
ffmpeg does not run in a distributed manner afaik. So it will run on the driver.
If you want to run ffmpeg in parallel, you would have to find some distributed version of it. I have found some github projects but they seem pretty dead.
Perhaps you can take a look into OpenCV.
Some time ago I spoke with people who do AI (computer vision) on video, and they used Nvidia Deepstream. Might be worth looking into.
08-04-2022 03:41 AM
I believe that is true as well, since I am just using the python library implementation of it. If there would be a way to run it sequentially on the driver but once one row has been processed it hands it off to the executors again and materialises it outside of memory I would be fine with it (my main problem lies in memory usage, not speed/cpu utilisation), but I don't know if there is a way to do this.
I'll have a look at OpenCV, though last time I checked I couldn't get the decryption to work in there.
Deepstream seems a completely end-to-end system and since I have some bespoke preprocessing to do I don't think I can make that work (easily).
Thanks for your answer!
08-12-2022 08:45 AM
In the end, I managed to get it to work. The way I did it is to only have one executor per cluster node (assign executor.cores to be the node core count -2 and memory as much as allowed) and repartition to more partitions than rows in the table and limit the number of videos to process at a time. If someone has a better idea that allows for more parallelisation than my solution then let me know!
09-07-2022 04:19 AM
Hi @Robin G
Hope all is well! Just wanted to check in if you were able to resolve your issue and would you be happy to share the solution or mark an answer as best? Else please let us know if you need more help.
We'd love to hear from you.
Thanks!
09-09-2022 01:58 AM
Hi @Vidula Khanna
Hope you're doing good too! I haven't really been able to solve my issue beside the workaround mentioned above, with a very low number of videos to process at at time with a large VM, so I still consider this issue unresolved (and I don't like to mark my own answer as best, especially when it's so far from perfect 🙂 ). If you are able to provide some more pointers to a better solution then that would be very welcome of course!
05-30-2023 07:14 AM
In the end, I decided to change around the workflow so it is as efficient as I could imagine it:
I was hoping there would be a less roundabout way but this is how I managed to solve it in a fairly (cost) effective way
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.
If there isn’t a group near you, start one and help create a community that brings people together.
Request a New Group