Hi @kasiviss42,
This might sound like a rhetorical question, but letโs delve into the complexity of joins and filters and examine how generating a list of 2lakh values affects it.
Let's assume we have fact table with 1 billion record and dimension table with 2 lakh records
Fact Table = 1 Billion
Dimension Table = 2 lakh
Join - Complexity Considerations:
- Data Volume: Joining 1 billion rows with 200,000 unique IDs can be computationally expensive.
- Data Shuffling: The join operation may require significant data shuffling across the cluster nodes, which can be time-consuming and resource intensive.
- Indexing: If the join columns (ID columns) are indexed, the performance can be improved. Without indexes, the join operation will be slower.
- Result Size: The result set includes columns from both tables, which can increase the amount of data processed and transferred.
- Reference example, below join query on id column, which is taking average 0.45 to 0.60 seconds. This may vary based on data volume and number of join conditions.

Filter - Complexity Considerations:
- Filter Operation: This query applies a filter to the fact table, which is generally less complex than a join operation.
- Data Scanning: The performance of the filter operation depends on the size of the fact table. If the table is large, scanning it can still be time-consuming.
- Indexing: If the ID column is indexed, the filter operation can be significantly faster. Without an index, the query will perform a full table scan.
- Result Size: The query selects all columns (*), which can increase the amount of data processed and transferred, but it is still likely to be less than the join operation.
- Reference Example:
- Below shows the filter based on sub-query from dimension table, which is taking average 0.35 to 0.50 seconds. This may vary based on data volume and indexing.

- Below shows the filter based on values listed manually, which is taking around 0.21 to 0.35 seconds. This may vary based on data volume and indexing.

Now let's talk about generating 2lakh value using Python or Scala and passing value to select query as filter.
Let's consider roughly the time taken by different steps in pyspark, read data >> collect from dataframe >> loop through values >> join values and pass to filter. This would take additional time read and prepare values for filter. Below code is taking average 0.57 to 0.80 seconds based on number of values from dimension table.

Hope this helps.
Regards,
Hari Prasad
Regards,
Hari Prasad