Best Practices as a Beginner
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
09-18-2024 01:00 AM
Hello everyone,
I am working on a project where I need to conduct some analysis on a dataset with 1 billion rows. I extracted the Parquet file from Azure and saved it onto the DBFS. Every time I want to run SQL queries and do preprocessing/analysis, I create a temporary view of the Parquet file from DBFS, then run:
```%sql
CREATE OR REPLACE TABLE
SELECT *
FROM table_name;
```
Afterward, I execute SQL commands. Each time I do this, it takes about an hour, and sometimes the cluster dies. Even basic SQL commands like `DELETE` certain rows based on a condition take over 50 minutes.
I would like to know the most efficient way to handle this situation. I don’t want to wait so long, and sometimes the cluster dies on me.
This is my first time working with Big Data, and I’m looking for help. I tried caching or splitting the dataset into few parts based on the number of rows (200 million for each part) but it still didn't help.
Regards