Data Engineering

Forum Posts

Sorted by:

by RiyazAliM • Honored Contributor

11-09-2022 6:59:13 AM

9135 Views
3 replies
7 kudos

Resolved! Converting a transformation written in Spark Scala to PySpark

Hello all,I've been tasked to convert a Scala Spark code to PySpark code with minimal changes (kinda literal translation).I've come across some code that claims to be a list comprehension. Look below for code snippet:%scala val desiredColumn = Seq("f...

Data Engineering

9135 Views
3 replies
7 kudos

11-09-2022 6:59:13 AM

View Replies

Latest Reply

RiyazAliM
Honored Contributor

11-10-2022 2:43:37 AM

7 kudos

Another follow-up question, if you don't mind. @Pat Sienkiewicz As I was trying to parse the name column into multiple columns. I came across the data below:("James,\"A,B\", Smith", "2018", "M", 3000)In order to parse these comma-included middle na...

7 kudos

11-10-2022 2:43:37 AM

2 More Replies

by Mradul07 • New Contributor II

10-27-2022 3:20:05 PM

1441 Views
0 replies
1 kudos

Spark behavior while dealing with Actions & Transformations ?

Hi, My question is - what happens to the initial RDD after the action is performed on it. Does it disappear or stays in the memory or does it needs to be explicitly cached() if we want to use it again.For eg : If I execute this in a sequence :df_outp...

Data Engineering

1441 Views
0 replies
1 kudos

10-27-2022 3:20:05 PM

by sage5616 • Valued Contributor

08-03-2022 3:06:05 PM

25128 Views
3 replies
2 kudos

Resolved! Choosing the optimal cluster size/specs.

Hello everyone,I am trying to determine the appropriate cluster specifications/sizing for my workload:Run a PySpark task to transform a batch of input avro files to parquet files and create or re-create persistent views on these parquet files. This t...

Data Engineering

25128 Views
3 replies
2 kudos

08-03-2022 3:06:05 PM

View Replies

Latest Reply

Anonymous
Not applicable

08-07-2022 1:25:11 PM

2 kudos

If the data is 100MB, then I'd try a single node cluster, which will be the smallest and least expensive. You'll have more than enough memory to store it all. You can automate this and use a jobs cluster.

2 kudos

08-07-2022 1:25:11 PM

2 More Replies

by User16826994223 • Databricks Employee

06-25-2021 9:10:05 AM

2683 Views
1 replies
0 kudos

what are the four-phase of transformation where catalyst transformation is used

Data Engineering

2683 Views
1 replies
0 kudos

06-25-2021 9:10:05 AM

View Replies

Latest Reply

User16826994223
Databricks Employee

06-25-2021 9:10:26 AM

0 kudos

1. AnalysisThe first phase of Spark SQL optimization is the analysis. Spark SQL starts with a relationship to be processed that can be in two ways. A serious form from an AST (abstract syntax tree) returned by an SQL parser, and on the other hand fro...

0 kudos

06-25-2021 9:10:26 AM

by aladda • Databricks Employee

06-19-2021 8:42:15 PM

9134 Views
1 replies
0 kudos

Resolved! What is the difference between a Narrow Transformation and Wide Transformation

Data Engineering

9134 Views
1 replies
0 kudos

06-19-2021 8:42:15 PM

View Replies

Latest Reply

aladda
Databricks Employee

06-19-2021 8:44:13 PM

0 kudos

Narrow Transformation: In Narrow transformation, all the elements that are required to compute the records in single partition live in the single partition of parent RDD. Ex:- Select, Filter, Union, Wide Transformation: Wide transformation, all the e...

0 kudos

06-19-2021 8:44:13 PM

Databricks Community

Resolved! Converting a transformation written in Spark Scala to PySpark

Spark behavior while dealing with Actions & Transformations ?

Resolved! Choosing the optimal cluster size/specs.

what are the four-phase of transformation where catalyst transformation is used

Resolved! What is the difference between a Narrow Transformation and Wide Transformation