We have a pyspark data frame with 50 MN records. We can display records from it, but it takes around 10 minutes to print the shape of dataframe. We aim to use this data for modelling that will take some numerical features based on the final data frame computed here as input.
For better understanding we explained the issue with 5 record data frame and also added working pyspark code.
Please refer to attachment with sample code and detailed explanation..pyspark-issue.zip