𝐋𝐢𝐬𝐭 𝐨𝐟 𝐭𝐫𝐚𝐧𝐬𝐟𝐨𝐫𝐦𝐚𝐭𝐢𝐨𝐧𝐬 𝐚𝐧𝐝 𝐚𝐜𝐭𝐢𝐨𝐧𝐬 𝐮𝐬𝐞𝐝 𝐢𝐧 𝐀𝐩𝐚𝐜𝐡𝐞 𝐒𝐩𝐚𝐫𝐤 𝐃𝐚𝐭𝐚𝐅𝐫𝐚𝐦𝐞𝐬 𝐟𝐨𝐫 𝐚 𝐃𝐚𝐭𝐚 𝐄𝐧𝐠𝐢𝐧𝐞𝐞𝐫𝐢𝐧𝐠 𝐫𝐨𝐥𝐞:
𝐓𝐫𝐚𝐧𝐬𝐟𝐨𝐫𝐦𝐚𝐭𝐢𝐨𝐧𝐬:
Transformations are operations on DataFrames that return a new DataFrame. They are lazily evaluated, meaning they do not execute immediately but build a logical plan that is executed when an action is performed.
𝟏. 𝐁𝐚𝐬𝐢𝐜 𝐓𝐫𝐚𝐧𝐬𝐟𝐨𝐫𝐦𝐚𝐭𝐢𝐨𝐧𝐬:
𝐬𝐞𝐥𝐞𝐜𝐭(): Select specific columns.
𝐟𝐢𝐥𝐭𝐞𝐫(): Filter rows based on a condition.
𝐰𝐢𝐭𝐡𝐂𝐨𝐥𝐮𝐦𝐧():Add or replace a column.
𝐝𝐫𝐨𝐩(): Remove columns.
𝐰𝐡𝐞𝐫𝐞(𝐜𝐨𝐧𝐝𝐢𝐭𝐢𝐨𝐧): Equivalent to filter(condition).
𝐝𝐫𝐨𝐩(*𝐜𝐨𝐥𝐬): Returns a new DataFrame with columns dropped.
𝐝𝐢𝐬𝐭𝐢𝐧𝐜𝐭():Remove duplicate rows.
𝐬𝐨𝐫𝐭(): Sort the DataFrame by columns.
𝐨𝐫𝐝𝐞𝐫𝐁𝐲(): Order the DataFrame by columns.
𝟐. 𝐀𝐠𝐠𝐫𝐞𝐠𝐚𝐭𝐢𝐨𝐧 𝐚𝐧𝐝 𝐆𝐫𝐨𝐮𝐩𝐢𝐧𝐠:
𝐠𝐫𝐨𝐮𝐩𝐁𝐲(): Group rows by column values.
𝐚𝐠𝐠(): Aggregate data using functions.
𝐜𝐨𝐮𝐧𝐭(): Count rows.
𝐬𝐮𝐦(*𝐜𝐨𝐥𝐬):Computes the sum for each numeric column.
𝐚𝐯𝐠(*𝐜𝐨𝐥𝐬): Computes the average for each numeric column.
𝐦𝐢𝐧(*𝐜𝐨𝐥𝐬):Computes the minimum value for each column.
𝐦𝐚𝐱(*𝐜𝐨𝐥𝐬): Computes the maximum value for each column.
𝟑. 𝐉𝐨𝐢𝐧𝐢𝐧𝐠 𝐃𝐚𝐭𝐚𝐅𝐫𝐚𝐦𝐞𝐬:
𝐣𝐨𝐢𝐧(𝐨𝐭𝐡𝐞𝐫, 𝐨𝐧=𝐍𝐨𝐧𝐞, 𝐡𝐨𝐰=𝐍𝐨𝐧𝐞): Joins with another DataFrame using the given join expression.
𝐮𝐧𝐢𝐨𝐧(): Combine two DataFrames with the same schema.
𝐢𝐧𝐭𝐞𝐫𝐬𝐞𝐜𝐭(): Return common rows between DataFrames.
𝟒. 𝐀𝐝𝐯𝐚𝐧𝐜𝐞𝐝 𝐓𝐫𝐚𝐧𝐬𝐟𝐨𝐫𝐦𝐚𝐭𝐢𝐨𝐧𝐬:
𝐰𝐢𝐭𝐡𝐂𝐨𝐥𝐮𝐦𝐧𝐑𝐞𝐧𝐚𝐦𝐞𝐝(): Rename a column.
𝐝𝐫𝐨𝐩𝐃𝐮𝐩𝐥𝐢𝐜𝐚𝐭𝐞𝐬(): Drop duplicate rows based on columns.
𝐬𝐚𝐦𝐩𝐥𝐞(): Sample a fraction of rows.
𝐥𝐢𝐦𝐢𝐭(): Limit the number of rows.
𝟓. 𝐖𝐢𝐧𝐝𝐨𝐰 𝐅𝐮𝐧𝐜𝐭𝐢𝐨𝐧𝐬:
𝐨𝐯𝐞𝐫(𝐰𝐢𝐧𝐝𝐨𝐰𝐒𝐩𝐞𝐜): Defines a window specification for window functions.
𝐫𝐨𝐰_𝐧𝐮𝐦𝐛𝐞𝐫().𝐨𝐯𝐞𝐫(𝐰𝐢𝐧𝐝𝐨𝐰𝐒𝐩𝐞𝐜): Assigns a row number starting at 1 within a window partition.
rank().over(windowSpec): Provides the rank of rows within a window partition.
𝐀𝐜𝐭𝐢𝐨𝐧𝐬:
Actions trigger the execution of the transformations and return a result to the driver program or write data to an external storage system.
1. Basic Actions:
show(): Display the top rows of the DataFrame.
collect(): Return all rows as an array.
count(): Count the number of rows.
take(): Return the first N rows as an array.
first(): Return the first row.
head(): Return the first N rows.
2. Writing Data:
write(): Write the DataFrame to external storage.
write.mode(): Specify save mode (e.g., overwrite, append).
save(): Save the DataFrame to a specified path.
toJSON(): Convert the DataFrame to a JSON dataset.
3. Other Actions:
foreach(): Apply a function to each row.
foreachPartition(): Apply a function to each partition.