I have a spark pipeline which reads selected data from a table_1 as view and performs few aggregation via group by in next step and writes to target table. table_1 has large data ~30GB, compressed csv.Step-1:create or replace temporary view base_data...
On Databricks created a job task with task type as Python script from s3. However, when arguments are passed via Parameters option, running into unrecognized arguments' error.Code in s3 file:import argparse
def parse_arguments():
parser = argpar...
much appreciate. Lastly before closing, is not overhead memory outside of spark.executor.memory (7.6GB) ? If spark.executor.memory is 7.6GB, is this dedicatedly for storage/execution with some reserved? Because 16 GB is total memory per machine, out ...
Appreciate Avinash for the detailed response. I'm using n2 high cpu 16 has 16GB, with 10 workers total 160GB. Memory is 7616M per executor i.e. executor memory is 77GB in total. Would it not be able to handle the (30GB compressed CSV) data for disti...
Thanks for the response. Could you please elaborate why is the distinct is an expensive operation? From my understanding, it's similar to a group by operation, where Spark likely uses hashing as a key to shuffle the data and eliminate duplicates. Why...