Best Practices as a Beginner

Faiyaz17 — Wed, 18 Sep 2024 08:00:56 GMT

Hello everyone,

I am working on a project where I need to conduct some analysis on a dataset with 1 billion rows. I extracted the Parquet file from Azure and saved it onto the DBFS. Every time I want to run SQL queries and do preprocessing/analysis, I create a temporary view of the Parquet file from DBFS, then run:

```%sql
CREATE OR REPLACE TABLE
SELECT *
FROM table_name;
```

Afterward, I execute SQL commands. Each time I do this, it takes about an hour, and sometimes the cluster dies. Even basic SQL commands like `DELETE` certain rows based on a condition take over 50 minutes.

I would like to know the most efficient way to handle this situation. I don’t want to wait so long, and sometimes the cluster dies on me.

This is my first time working with Big Data, and I’m looking for help. I tried caching or splitting the dataset into few parts based on the number of rows (200 million for each part) but it still didn't help.

Regards

Re: Best Practices as a Beginner

jennie258fitz — Wed, 18 Sep 2024 08:53:32 GMT

@Faiyaz17 wrote:
Hello everyone,
I am working on a project where I need to conduct some analysis on a dataset with 1 billion rows. I extracted the Parquet file from Azure and saved it onto the DBFS. Every time I want to run SQL queries and do preprocessing/analysis, I create a temporary view of the Parquet file from DBFS, then run:
```%sql
CREATE OR REPLACE TABLE
SELECT *
FROM table_name;
```
Afterward, I execute SQL commands. Each time I do this, it takes about an hour, and sometimes the cluster dies. Even basic SQL commands like `DELETE` certain rows based on a condition take over 50 minutes.
I would like to know the most efficient way to handle this situation. I don’t want to wait so long, and sometimes the cluster dies on me.
This is my first time working with Big Data, and I’m looking for help. I tried caching or splitting the dataset into few parts based on the number of rows (200 million for each part) but it still didn't help.
Regards

Hello,

Here are some concise strategies to optimize your work with the 1 billion-row dataset:

Cluster Configuration:

Increase cluster size for more resources.
Use spot instances for cost savings.
Data Partitioning:

Partition the Parquet files by relevant columns.
Create partitioned tables for faster queries.
Use DataFrames:

Load Parquet files into DataFrames instead of temporary views.
Filter and select only necessary columns early on.
Caching:

Use df.cache() to keep frequently accessed DataFrames in memory.
Optimize SQL Queries:

Avoid SELECT *; specify needed columns.
Break large DELETE operations into smaller batches.
Consider Delta Lake:

Convert Parquet files to Delta Lake for improved performance and features.
Monitor and Tune Performance:

Use Spark UI to identify bottlenecks.
Adjust Spark configurations based on workload.
Use SQL Optimizations:

Create materialized views for frequently accessed queries.
Analyze and collect statistics for performance insights.

Best regards,
JennieF

topic Re: Best Practices as a Beginner in Data Engineering

Best Practices as a Beginner

Re: Best Practices as a Beginner