Hi, You can read about Bloom Filter. It can drastically decrease the I/O and also have small footprint.It tells the spark, that the join id is not definitely in a partition (No False Negative , 100% correct)Bloom filters only work for equality (=) jo...
%sh runs a shell command on the driver node’s OS, not inside the notebook’s Python/Spark runtime. It basically opens a separate Linux process on the driver machine.The Spark session, on the other hand, is attached to the notebook runtime. So when you...