<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Best Practices as a Beginner in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/best-practices-as-a-beginner/m-p/90851#M38007</link>
    <description>&lt;BLOCKQUOTE&gt;&lt;HR /&gt;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/121425"&gt;@Faiyaz17&lt;/a&gt;&amp;nbsp;wrote:&lt;BR /&gt;&lt;P&gt;Hello everyone,&lt;/P&gt;&lt;P&gt;I am working on a project where I need to conduct some analysis on a dataset with 1 billion rows. I extracted the Parquet file from Azure and saved it onto the DBFS. Every time I want to run SQL queries and do preprocessing/analysis, I create a temporary view of the Parquet file from DBFS, then run:&lt;/P&gt;&lt;P&gt;```%sql&lt;BR /&gt;CREATE OR REPLACE TABLE&lt;BR /&gt;SELECT *&lt;BR /&gt;FROM table_name;&lt;BR /&gt;```&lt;/P&gt;&lt;P&gt;Afterward, I execute SQL commands. Each time I do this, it takes about an hour, and sometimes the cluster dies. Even basic SQL commands like `DELETE` certain rows based on a condition take over 50 minutes.&lt;/P&gt;&lt;P&gt;I would like to know the most efficient way to handle this situation. I don’t want to wait so long, and sometimes the cluster dies on me.&lt;/P&gt;&lt;P&gt;This is my first time working with Big Data, and I’m looking for help. I tried caching or splitting the dataset into few parts based on the number of rows (200 million for each part) but it still didn't help.&lt;/P&gt;&lt;P&gt;Regards&lt;/P&gt;&lt;HR /&gt;&lt;/BLOCKQUOTE&gt;&lt;P&gt;Hello,&lt;/P&gt;&lt;P&gt;Here are some concise strategies to optimize your work with the 1 billion-row dataset:&lt;/P&gt;&lt;P&gt;Cluster Configuration:&lt;/P&gt;&lt;P&gt;Increase cluster size for more resources.&lt;BR /&gt;Use spot instances for cost savings.&lt;BR /&gt;Data Partitioning:&lt;/P&gt;&lt;P&gt;Partition the Parquet files by relevant columns.&lt;BR /&gt;Create partitioned tables for faster queries.&lt;BR /&gt;Use DataFrames:&lt;/P&gt;&lt;P&gt;Load Parquet files into DataFrames instead of temporary views.&lt;BR /&gt;Filter and select only necessary columns early on.&lt;BR /&gt;Caching:&lt;/P&gt;&lt;P&gt;Use df.cache() to keep frequently accessed DataFrames in memory.&lt;BR /&gt;Optimize SQL Queries:&lt;/P&gt;&lt;P&gt;Avoid SELECT *; specify needed columns.&lt;BR /&gt;Break large DELETE operations into smaller batches.&lt;BR /&gt;Consider Delta Lake:&lt;/P&gt;&lt;P&gt;Convert Parquet files to Delta Lake for improved performance and features.&lt;BR /&gt;Monitor and Tune Performance:&lt;/P&gt;&lt;P&gt;Use Spark UI to identify bottlenecks.&lt;BR /&gt;Adjust Spark configurations based on workload.&lt;BR /&gt;Use SQL Optimizations:&lt;/P&gt;&lt;P&gt;Create materialized views for frequently accessed queries.&lt;BR /&gt;Analyze and collect statistics for performance insights.&lt;BR /&gt;&lt;BR /&gt;Best regards,&lt;BR /&gt;JennieF&lt;/P&gt;</description>
    <pubDate>Wed, 18 Sep 2024 08:53:32 GMT</pubDate>
    <dc:creator>jennie258fitz</dc:creator>
    <dc:date>2024-09-18T08:53:32Z</dc:date>
    <item>
      <title>Best Practices as a Beginner</title>
      <link>https://community.databricks.com/t5/data-engineering/best-practices-as-a-beginner/m-p/90833#M38003</link>
      <description>&lt;P&gt;Hello everyone,&lt;/P&gt;&lt;P&gt;I am working on a project where I need to conduct some analysis on a dataset with 1 billion rows. I extracted the Parquet file from Azure and saved it onto the DBFS. Every time I want to run SQL queries and do preprocessing/analysis, I create a temporary view of the Parquet file from DBFS, then run:&lt;/P&gt;&lt;P&gt;```%sql&lt;BR /&gt;CREATE OR REPLACE TABLE&lt;BR /&gt;SELECT *&lt;BR /&gt;FROM table_name;&lt;BR /&gt;```&lt;/P&gt;&lt;P&gt;Afterward, I execute SQL commands. Each time I do this, it takes about an hour, and sometimes the cluster dies. Even basic SQL commands like `DELETE` certain rows based on a condition take over 50 minutes.&lt;/P&gt;&lt;P&gt;I would like to know the most efficient way to handle this situation. I don’t want to wait so long, and sometimes the cluster dies on me.&lt;/P&gt;&lt;P&gt;This is my first time working with Big Data, and I’m looking for help. I tried caching or splitting the dataset into few parts based on the number of rows (200 million for each part) but it still didn't help.&lt;/P&gt;&lt;P&gt;Regards&lt;/P&gt;</description>
      <pubDate>Wed, 18 Sep 2024 08:00:56 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/best-practices-as-a-beginner/m-p/90833#M38003</guid>
      <dc:creator>Faiyaz17</dc:creator>
      <dc:date>2024-09-18T08:00:56Z</dc:date>
    </item>
    <item>
      <title>Re: Best Practices as a Beginner</title>
      <link>https://community.databricks.com/t5/data-engineering/best-practices-as-a-beginner/m-p/90851#M38007</link>
      <description>&lt;BLOCKQUOTE&gt;&lt;HR /&gt;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/121425"&gt;@Faiyaz17&lt;/a&gt;&amp;nbsp;wrote:&lt;BR /&gt;&lt;P&gt;Hello everyone,&lt;/P&gt;&lt;P&gt;I am working on a project where I need to conduct some analysis on a dataset with 1 billion rows. I extracted the Parquet file from Azure and saved it onto the DBFS. Every time I want to run SQL queries and do preprocessing/analysis, I create a temporary view of the Parquet file from DBFS, then run:&lt;/P&gt;&lt;P&gt;```%sql&lt;BR /&gt;CREATE OR REPLACE TABLE&lt;BR /&gt;SELECT *&lt;BR /&gt;FROM table_name;&lt;BR /&gt;```&lt;/P&gt;&lt;P&gt;Afterward, I execute SQL commands. Each time I do this, it takes about an hour, and sometimes the cluster dies. Even basic SQL commands like `DELETE` certain rows based on a condition take over 50 minutes.&lt;/P&gt;&lt;P&gt;I would like to know the most efficient way to handle this situation. I don’t want to wait so long, and sometimes the cluster dies on me.&lt;/P&gt;&lt;P&gt;This is my first time working with Big Data, and I’m looking for help. I tried caching or splitting the dataset into few parts based on the number of rows (200 million for each part) but it still didn't help.&lt;/P&gt;&lt;P&gt;Regards&lt;/P&gt;&lt;HR /&gt;&lt;/BLOCKQUOTE&gt;&lt;P&gt;Hello,&lt;/P&gt;&lt;P&gt;Here are some concise strategies to optimize your work with the 1 billion-row dataset:&lt;/P&gt;&lt;P&gt;Cluster Configuration:&lt;/P&gt;&lt;P&gt;Increase cluster size for more resources.&lt;BR /&gt;Use spot instances for cost savings.&lt;BR /&gt;Data Partitioning:&lt;/P&gt;&lt;P&gt;Partition the Parquet files by relevant columns.&lt;BR /&gt;Create partitioned tables for faster queries.&lt;BR /&gt;Use DataFrames:&lt;/P&gt;&lt;P&gt;Load Parquet files into DataFrames instead of temporary views.&lt;BR /&gt;Filter and select only necessary columns early on.&lt;BR /&gt;Caching:&lt;/P&gt;&lt;P&gt;Use df.cache() to keep frequently accessed DataFrames in memory.&lt;BR /&gt;Optimize SQL Queries:&lt;/P&gt;&lt;P&gt;Avoid SELECT *; specify needed columns.&lt;BR /&gt;Break large DELETE operations into smaller batches.&lt;BR /&gt;Consider Delta Lake:&lt;/P&gt;&lt;P&gt;Convert Parquet files to Delta Lake for improved performance and features.&lt;BR /&gt;Monitor and Tune Performance:&lt;/P&gt;&lt;P&gt;Use Spark UI to identify bottlenecks.&lt;BR /&gt;Adjust Spark configurations based on workload.&lt;BR /&gt;Use SQL Optimizations:&lt;/P&gt;&lt;P&gt;Create materialized views for frequently accessed queries.&lt;BR /&gt;Analyze and collect statistics for performance insights.&lt;BR /&gt;&lt;BR /&gt;Best regards,&lt;BR /&gt;JennieF&lt;/P&gt;</description>
      <pubDate>Wed, 18 Sep 2024 08:53:32 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/best-practices-as-a-beginner/m-p/90851#M38007</guid>
      <dc:creator>jennie258fitz</dc:creator>
      <dc:date>2024-09-18T08:53:32Z</dc:date>
    </item>
  </channel>
</rss>

