Hi guys!
I'm having a problem at work where I need to process a customer data dataset with 300 billion rows and 5 columns. The transformations I need to perform are "simple," like joins to assign characteristics to customers. And at the end of the process, I need to save a .csv file to S3. Currently, my notebook takes 35 to 50 hours to run. The data volume is really huge, we have over 100 million customers.