Best Practices as a Beginner

Faiyaz17 · ‎09-18-2024

Hello everyone,

I am working on a project where I need to conduct some analysis on a dataset with 1 billion rows. I extracted the Parquet file from Azure and saved it onto the DBFS. Every time I want to run SQL queries and do preprocessing/analysis, I create a temporary view of the Parquet file from DBFS, then run:

```%sql
CREATE OR REPLACE TABLE
SELECT *
FROM table_name;
```

Afterward, I execute SQL commands. Each time I do this, it takes about an hour, and sometimes the cluster dies. Even basic SQL commands like `DELETE` certain rows based on a condition take over 50 minutes.

I would like to know the most efficient way to handle this situation. I don’t want to wait so long, and sometimes the cluster dies on me.

This is my first time working with Big Data, and I’m looking for help. I tried caching or splitting the dataset into few parts based on the number of rows (200 million for each part) but it still didn't help.

Regards