cancel
Showing results for 
Search instead for 
Did you mean: 
Women in Data & AI
cancel
Showing results for 
Search instead for 
Did you mean: 

𝐭𝐫𝐚𝐧𝐬𝐟𝐨𝐫𝐦𝐚𝐭𝐢𝐨𝐧𝐬 𝐚𝐧𝐝 𝐚𝐜𝐭𝐢𝐨𝐧𝐬 𝐮𝐬𝐞𝐝 𝐢𝐧 𝐀𝐩𝐚𝐜𝐡𝐞 𝐒𝐩𝐚𝐫𝐤

Yogic24
Valued Contributor II

𝐋𝐢𝐬𝐭 𝐨𝐟 𝐭𝐫𝐚𝐧𝐬𝐟𝐨𝐫𝐦𝐚𝐭𝐢𝐨𝐧𝐬 𝐚𝐧𝐝 𝐚𝐜𝐭𝐢𝐨𝐧𝐬 𝐮𝐬𝐞𝐝 𝐢𝐧 𝐀𝐩𝐚𝐜𝐡𝐞 𝐒𝐩𝐚𝐫𝐤 𝐃𝐚𝐭𝐚𝐅𝐫𝐚𝐦𝐞𝐬 𝐟𝐨𝐫 𝐚 𝐃𝐚𝐭𝐚 𝐄𝐧𝐠𝐢𝐧𝐞𝐞𝐫𝐢𝐧𝐠 𝐫𝐨𝐥𝐞:

𝐓𝐫𝐚𝐧𝐬𝐟𝐨𝐫𝐦𝐚𝐭𝐢𝐨𝐧𝐬:
Transformations are operations on DataFrames that return a new DataFrame. They are lazily evaluated, meaning they do not execute immediately but build a logical plan that is executed when an action is performed.

𝟏. 𝐁𝐚𝐬𝐢𝐜 𝐓𝐫𝐚𝐧𝐬𝐟𝐨𝐫𝐦𝐚𝐭𝐢𝐨𝐧𝐬:
𝐬𝐞𝐥𝐞𝐜𝐭():  Select specific columns.
𝐟𝐢𝐥𝐭𝐞𝐫(): Filter rows based on a condition.
𝐰𝐢𝐭𝐡𝐂𝐨𝐥𝐮𝐦𝐧():Add or replace a column.
𝐝𝐫𝐨𝐩():  Remove columns.
𝐰𝐡𝐞𝐫𝐞(𝐜𝐨𝐧𝐝𝐢𝐭𝐢𝐨𝐧): Equivalent to filter(condition).
𝐝𝐫𝐨𝐩(*𝐜𝐨𝐥𝐬): Returns a new DataFrame with columns dropped.
𝐝𝐢𝐬𝐭𝐢𝐧𝐜𝐭():Remove duplicate rows.
𝐬𝐨𝐫𝐭(): Sort the DataFrame by columns.
𝐨𝐫𝐝𝐞𝐫𝐁𝐲(): Order the DataFrame by columns.

𝟐. 𝐀𝐠𝐠𝐫𝐞𝐠𝐚𝐭𝐢𝐨𝐧 𝐚𝐧𝐝 𝐆𝐫𝐨𝐮𝐩𝐢𝐧𝐠:
𝐠𝐫𝐨𝐮𝐩𝐁𝐲(): Group rows by column values.
𝐚𝐠𝐠(): Aggregate data using functions.
𝐜𝐨𝐮𝐧𝐭():  Count rows.
𝐬𝐮𝐦(*𝐜𝐨𝐥𝐬):Computes the sum for each numeric column.
𝐚𝐯𝐠(*𝐜𝐨𝐥𝐬): Computes the average for each numeric column.
𝐦𝐢𝐧(*𝐜𝐨𝐥𝐬):Computes the minimum value for each column.
𝐦𝐚𝐱(*𝐜𝐨𝐥𝐬): Computes the maximum value for each column.

𝟑. 𝐉𝐨𝐢𝐧𝐢𝐧𝐠 𝐃𝐚𝐭𝐚𝐅𝐫𝐚𝐦𝐞𝐬:
𝐣𝐨𝐢𝐧(𝐨𝐭𝐡𝐞𝐫, 𝐨𝐧=𝐍𝐨𝐧𝐞, 𝐡𝐨𝐰=𝐍𝐨𝐧𝐞):  Joins with another DataFrame using the given join expression.
𝐮𝐧𝐢𝐨𝐧(): Combine two DataFrames with the same schema.
𝐢𝐧𝐭𝐞𝐫𝐬𝐞𝐜𝐭(): Return common rows between DataFrames.

𝟒. 𝐀𝐝𝐯𝐚𝐧𝐜𝐞𝐝 𝐓𝐫𝐚𝐧𝐬𝐟𝐨𝐫𝐦𝐚𝐭𝐢𝐨𝐧𝐬:
𝐰𝐢𝐭𝐡𝐂𝐨𝐥𝐮𝐦𝐧𝐑𝐞𝐧𝐚𝐦𝐞𝐝():  Rename a column.
𝐝𝐫𝐨𝐩𝐃𝐮𝐩𝐥𝐢𝐜𝐚𝐭𝐞𝐬(): Drop duplicate rows based on columns.
𝐬𝐚𝐦𝐩𝐥𝐞(): Sample a fraction of rows.
𝐥𝐢𝐦𝐢𝐭(): Limit the number of rows.

𝟓. 𝐖𝐢𝐧𝐝𝐨𝐰 𝐅𝐮𝐧𝐜𝐭𝐢𝐨𝐧𝐬:
𝐨𝐯𝐞𝐫(𝐰𝐢𝐧𝐝𝐨𝐰𝐒𝐩𝐞𝐜): Defines a window specification for window functions.
𝐫𝐨𝐰_𝐧𝐮𝐦𝐛𝐞𝐫().𝐨𝐯𝐞𝐫(𝐰𝐢𝐧𝐝𝐨𝐰𝐒𝐩𝐞𝐜): Assigns a row number starting at 1 within a window partition.
rank().over(windowSpec):  Provides the rank of rows within a window partition.

𝐀𝐜𝐭𝐢𝐨𝐧𝐬:
Actions trigger the execution of the transformations and return a result to the driver program or write data to an external storage system.

1. Basic Actions:
show(): Display the top rows of the DataFrame.
collect(): Return all rows as an array.
count(): Count the number of rows.
take(): Return the first N rows as an array.
first(): Return the first row.
head(): Return the first N rows.

2. Writing Data:
write(): Write the DataFrame to external storage.
write.mode(): Specify save mode (e.g., overwrite, append).
save(): Save the DataFrame to a specified path.
toJSON(): Convert the DataFrame to a JSON dataset.

3. Other Actions:
foreach(): Apply a function to each row.
foreachPartition(): Apply a function to each partition.

 

1 REPLY 1

ramyabodapati
New Contributor II

Very informative thanks 😊 

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!

Sign Up Now