cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Best Practices as a Beginner

Faiyaz17
New Contributor

Hello everyone,

I am working on a project where I need to conduct some analysis on a dataset with 1 billion rows. I extracted the Parquet file from Azure and saved it onto the DBFS. Every time I want to run SQL queries and do preprocessing/analysis, I create a temporary view of the Parquet file from DBFS, then run:

```%sql
CREATE OR REPLACE TABLE
SELECT *
FROM table_name;
```

Afterward, I execute SQL commands. Each time I do this, it takes about an hour, and sometimes the cluster dies. Even basic SQL commands like `DELETE` certain rows based on a condition take over 50 minutes.

I would like to know the most efficient way to handle this situation. I don’t want to wait so long, and sometimes the cluster dies on me.

This is my first time working with Big Data, and I’m looking for help. I tried caching or splitting the dataset into few parts based on the number of rows (200 million for each part) but it still didn't help.

Regards

1 REPLY 1

jennie258fitz
New Contributor

@Faiyaz17 wrote:

Hello everyone,

I am working on a project where I need to conduct some analysis on a dataset with 1 billion rows. I extracted the Parquet file from Azure and saved it onto the DBFS. Every time I want to run SQL queries and do preprocessing/analysis, I create a temporary view of the Parquet file from DBFS, then run:

```%sql
CREATE OR REPLACE TABLE
SELECT *
FROM table_name;
```

Afterward, I execute SQL commands. Each time I do this, it takes about an hour, and sometimes the cluster dies. Even basic SQL commands like `DELETE` certain rows based on a condition take over 50 minutes.

I would like to know the most efficient way to handle this situation. I don’t want to wait so long, and sometimes the cluster dies on me.

This is my first time working with Big Data, and I’m looking for help. I tried caching or splitting the dataset into few parts based on the number of rows (200 million for each part) but it still didn't help.

Regards


Hello,

Here are some concise strategies to optimize your work with the 1 billion-row dataset:

Cluster Configuration:

Increase cluster size for more resources.
Use spot instances for cost savings.
Data Partitioning:

Partition the Parquet files by relevant columns.
Create partitioned tables for faster queries.
Use DataFrames:

Load Parquet files into DataFrames instead of temporary views.
Filter and select only necessary columns early on.
Caching:

Use df.cache() to keep frequently accessed DataFrames in memory.
Optimize SQL Queries:

Avoid SELECT *; specify needed columns.
Break large DELETE operations into smaller batches.
Consider Delta Lake:

Convert Parquet files to Delta Lake for improved performance and features.
Monitor and Tune Performance:

Use Spark UI to identify bottlenecks.
Adjust Spark configurations based on workload.
Use SQL Optimizations:

Create materialized views for frequently accessed queries.
Analyze and collect statistics for performance insights.

Best regards,
JennieF

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group