cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Forum Posts

RiyazAli
by Valued Contributor III
  • 7158 Views
  • 3 replies
  • 7 kudos

Resolved! Converting a transformation written in Spark Scala to PySpark

Hello all,I've been tasked to convert a Scala Spark code to PySpark code with minimal changes (kinda literal translation).I've come across some code that claims to be a list comprehension. Look below for code snippet:%scala val desiredColumn = Seq("f...

  • 7158 Views
  • 3 replies
  • 7 kudos
Latest Reply
RiyazAli
Valued Contributor III
  • 7 kudos

Another follow-up question, if you don't mind. @Pat Sienkiewicz​ As I was trying to parse the name column into multiple columns. I came across the data below:("James,\"A,B\", Smith", "2018", "M", 3000)In order to parse these comma-included middle na...

  • 7 kudos
2 More Replies
Mradul07
by New Contributor II
  • 981 Views
  • 0 replies
  • 1 kudos

Spark behavior while dealing with Actions & Transformations ?

Hi, My question is - what happens to the initial RDD after the action is performed on it. Does it disappear or stays in the memory or does it needs to be explicitly cached() if we want to use it again.For eg : If I execute this in a sequence :df_outp...

  • 981 Views
  • 0 replies
  • 1 kudos
sage5616
by Valued Contributor
  • 19862 Views
  • 3 replies
  • 2 kudos

Resolved! Choosing the optimal cluster size/specs.

Hello everyone,I am trying to determine the appropriate cluster specifications/sizing for my workload:Run a PySpark task to transform a batch of input avro files to parquet files and create or re-create persistent views on these parquet files. This t...

  • 19862 Views
  • 3 replies
  • 2 kudos
Latest Reply
Anonymous
Not applicable
  • 2 kudos

If the data is 100MB, then I'd try a single node cluster, which will be the smallest and least expensive. You'll have more than enough memory to store it all. You can automate this and use a jobs cluster.

  • 2 kudos
2 More Replies
User16826994223
by Honored Contributor III
  • 1080 Views
  • 1 replies
  • 0 kudos
  • 1080 Views
  • 1 replies
  • 0 kudos
Latest Reply
User16826994223
Honored Contributor III
  • 0 kudos

1. AnalysisThe first phase of Spark SQL optimization is the analysis. Spark SQL starts with a relationship to be processed that can be in two ways. A serious form from an AST (abstract syntax tree) returned by an SQL parser, and on the other hand fro...

  • 0 kudos
aladda
by Databricks Employee
  • 6482 Views
  • 1 replies
  • 0 kudos
  • 6482 Views
  • 1 replies
  • 0 kudos
Latest Reply
aladda
Databricks Employee
  • 0 kudos

Narrow Transformation: In Narrow transformation, all the elements that are required to compute the records in single partition live in the single partition of parent RDD. Ex:- Select, Filter, Union, Wide Transformation: Wide transformation, all the e...

  • 0 kudos
Labels