cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Get Started Discussions
Start your journey with Databricks by joining discussions on getting started guides, tutorials, and introductory topics. Connect with beginners and experts alike to kickstart your Databricks experience.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Apache spark proficient

SamAWS
New Contributor III

What is the best way to be proficient in Apache spark?

1 ACCEPTED SOLUTION

Accepted Solutions

Kaniz_Fatma
Community Manager
Community Manager

Hi @SamAWS , 

To become proficient in Apache Sparkโ„ข, significantly leveraging the resources available in the Databricks Community, follow these steps:

  1. Getting Started with Databricks Community:

    • Sign up for a Databricks Community account if you haven't already.
    • Explore the Databricks platform interface to get familiar with its layout and features.
  2. Access Learning Resources:

    • Databricks provides extensive documentation and tutorials that cover a wide range of topics related to Apache Sparkโ„ข.
    • Utilize Databricksโ€™ "Workspace" to access notebooks and learning materials. Start with the provided examples and tutorials.
  3. Interactive Notebooks:

    • Databricks notebooks are a powerful way to learn and experiment with Spark.
    • Begin by running simple code snippets to understand the basics of Spark's DataFrame API and RDDs (Resilient Distributed Datasets).
  4. Online Courses and Certifications:

    • Databricks offers online courses and certifications that can significantly accelerate your learning process.
    • Take courses on Spark essentials, data engineering, machine learning with Spark, and more.
  5. Collaborate and Learn from the Community:

    • Databricks Community has a vibrant forum where you can ask questions, share insights, and learn from other users.
    • Engage with the community to solve problems and discuss best practices.
  6. Hands-On Projects:

    • Apply what you've learned by working on real-world projects. Databricks allow you to create and manage clusters to process data.
  7. Performance Optimization:

    • Dive into Spark's performance optimization techniques. Databricks provides tools for monitoring and profiling your Spark jobs.
  8. Advanced Topics:

    • Once you're comfortable with the basics, explore more advanced topics like structured streaming, graph processing, and deep learning with Spark.
  9. Stay Updated:

    • Follow Databricksโ€™ official blog and release notes to stay up-to-date with the latest features and enhancements.
  10. Networking:

    • Attend Databricks webinars, conferences, and meetups to connect with other data professionals and learn from their experiences.

Remember, becoming proficient in Apache Sparkโ„ข takes time and practice. The Databricks Community offers an excellent platform to learn, experiment, and grow your Spark skills. Good luck on your learning journey!

View solution in original post

4 REPLIES 4

Kaniz_Fatma
Community Manager
Community Manager

Hi @SamAWS , 

To become proficient in Apache Sparkโ„ข, significantly leveraging the resources available in the Databricks Community, follow these steps:

  1. Getting Started with Databricks Community:

    • Sign up for a Databricks Community account if you haven't already.
    • Explore the Databricks platform interface to get familiar with its layout and features.
  2. Access Learning Resources:

    • Databricks provides extensive documentation and tutorials that cover a wide range of topics related to Apache Sparkโ„ข.
    • Utilize Databricksโ€™ "Workspace" to access notebooks and learning materials. Start with the provided examples and tutorials.
  3. Interactive Notebooks:

    • Databricks notebooks are a powerful way to learn and experiment with Spark.
    • Begin by running simple code snippets to understand the basics of Spark's DataFrame API and RDDs (Resilient Distributed Datasets).
  4. Online Courses and Certifications:

    • Databricks offers online courses and certifications that can significantly accelerate your learning process.
    • Take courses on Spark essentials, data engineering, machine learning with Spark, and more.
  5. Collaborate and Learn from the Community:

    • Databricks Community has a vibrant forum where you can ask questions, share insights, and learn from other users.
    • Engage with the community to solve problems and discuss best practices.
  6. Hands-On Projects:

    • Apply what you've learned by working on real-world projects. Databricks allow you to create and manage clusters to process data.
  7. Performance Optimization:

    • Dive into Spark's performance optimization techniques. Databricks provides tools for monitoring and profiling your Spark jobs.
  8. Advanced Topics:

    • Once you're comfortable with the basics, explore more advanced topics like structured streaming, graph processing, and deep learning with Spark.
  9. Stay Updated:

    • Follow Databricksโ€™ official blog and release notes to stay up-to-date with the latest features and enhancements.
  10. Networking:

    • Attend Databricks webinars, conferences, and meetups to connect with other data professionals and learn from their experiences.

Remember, becoming proficient in Apache Sparkโ„ข takes time and practice. The Databricks Community offers an excellent platform to learn, experiment, and grow your Spark skills. Good luck on your learning journey!

SamAWS
New Contributor III

Thank you so much for answering my question. 
Based on your experience. Should I use Scala or Python for data engineering?

SamAWS
New Contributor III

Thank you for the quick response.

Hi @SamAWSThe choice between Scala and Python for data engineering depends on various factors, such as the specific use case, the team's expertise, and the nature of the tasks.

Here's a comparison of both languages based on the provided information:

1. **Scala**:
  - Scala is native to the Apache Sparkโ„ข ecosystem, which Databricks is built upon. This means Scala might offer better performance and access to the latest features of Spark API (https://docs.databricks.com/getting-started/dataframes-scala.html)).
  - Scala can be used with the Databricks SDK and provides a comprehensive development environment with tools like IntelliJ IDEA ([Docs: sdk-java](https://docs.databricks.com/dev-tools/sdk-java.html)).
  - However, there are some limitations with Scala, such as Scala UDFs not being included in specific previews and not being supported for encrypting data in some pipelines (https://docs.google.com/document/d/1UEIUrz22w8TiwAo-q1ZMbDv_m4AlEkHvJ6QPE6OUUQY/)).

2. **Python**:
  - Python is widely used in the data science community and has a rich ecosystem of libraries and tools for data manipulation and analysis (https://docs.databricks.com/languages/index.html).
  - Python can be used for ETL and data engineering tasks in Databricks, offering a robust ETL experience (https://docs.databricks.com/introduction/index.html)).
  - Python UDFs are supported and recommended for specific tasks where Scala UDFs are not ([Docs: UDF Issues with Unity Catalog](https://docs.google.com/document/d/1UEIUrz22w8TiwAo-q1ZMbDv_m4AlEkHvJ6QPE6OUUQY/)).

In summary, while both languages are supported and can be used effectively in Databricks for data engineering, the choice between Scala and Python would depend on your project's specific requirements and constraints.

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโ€™t want to miss the chance to attend and share knowledge.

If there isnโ€™t a group near you, start one and help create a community that brings people together.

Request a New Group