cancel
Showing results for 
Search instead for 
Did you mean: 
Women in Data & AI
cancel
Showing results for 
Search instead for 
Did you mean: 

𝐭𝐫𝐚𝐧𝐬𝐟𝐨𝐫𝐦𝐚𝐭𝐢𝐨𝐧𝐬 𝐚𝐧𝐝 𝐚𝐜𝐭𝐢𝐨𝐧𝐬 𝐮𝐬𝐞𝐝 𝐢𝐧 𝐀𝐩𝐚𝐜𝐡𝐞 𝐒𝐩𝐚𝐫𝐤

Yogic24
Contributor III

𝐋𝐢𝐬𝐭 𝐨𝐟 𝐭𝐫𝐚𝐧𝐬𝐟𝐨𝐫𝐦𝐚𝐭𝐢𝐨𝐧𝐬 𝐚𝐧𝐝 𝐚𝐜𝐭𝐢𝐨𝐧𝐬 𝐮𝐬𝐞𝐝 𝐢𝐧 𝐀𝐩𝐚𝐜𝐡𝐞 𝐒𝐩𝐚𝐫𝐤 𝐃𝐚𝐭𝐚𝐅𝐫𝐚𝐦𝐞𝐬 𝐟𝐨𝐫 𝐚 𝐃𝐚𝐭𝐚 𝐄𝐧𝐠𝐢𝐧𝐞𝐞𝐫𝐢𝐧𝐠 𝐫𝐨𝐥𝐞:

𝐓𝐫𝐚𝐧𝐬𝐟𝐨𝐫𝐦𝐚𝐭𝐢𝐨𝐧𝐬:
Transformations are operations on DataFrames that return a new DataFrame. They are lazily evaluated, meaning they do not execute immediately but build a logical plan that is executed when an action is performed.

𝟏. 𝐁𝐚𝐬𝐢𝐜 𝐓𝐫𝐚𝐧𝐬𝐟𝐨𝐫𝐦𝐚𝐭𝐢𝐨𝐧𝐬:
𝐬𝐞𝐥𝐞𝐜𝐭():  Select specific columns.
𝐟𝐢𝐥𝐭𝐞𝐫(): Filter rows based on a condition.
𝐰𝐢𝐭𝐡𝐂𝐨𝐥𝐮𝐦𝐧():Add or replace a column.
𝐝𝐫𝐨𝐩():  Remove columns.
𝐰𝐡𝐞𝐫𝐞(𝐜𝐨𝐧𝐝𝐢𝐭𝐢𝐨𝐧): Equivalent to filter(condition).
𝐝𝐫𝐨𝐩(*𝐜𝐨𝐥𝐬): Returns a new DataFrame with columns dropped.
𝐝𝐢𝐬𝐭𝐢𝐧𝐜𝐭():Remove duplicate rows.
𝐬𝐨𝐫𝐭(): Sort the DataFrame by columns.
𝐨𝐫𝐝𝐞𝐫𝐁𝐲(): Order the DataFrame by columns.

𝟐. 𝐀𝐠𝐠𝐫𝐞𝐠𝐚𝐭𝐢𝐨𝐧 𝐚𝐧𝐝 𝐆𝐫𝐨𝐮𝐩𝐢𝐧𝐠:
𝐠𝐫𝐨𝐮𝐩𝐁𝐲(): Group rows by column values.
𝐚𝐠𝐠(): Aggregate data using functions.
𝐜𝐨𝐮𝐧𝐭():  Count rows.
𝐬𝐮𝐦(*𝐜𝐨𝐥𝐬):Computes the sum for each numeric column.
𝐚𝐯𝐠(*𝐜𝐨𝐥𝐬): Computes the average for each numeric column.
𝐦𝐢𝐧(*𝐜𝐨𝐥𝐬):Computes the minimum value for each column.
𝐦𝐚𝐱(*𝐜𝐨𝐥𝐬): Computes the maximum value for each column.

𝟑. 𝐉𝐨𝐢𝐧𝐢𝐧𝐠 𝐃𝐚𝐭𝐚𝐅𝐫𝐚𝐦𝐞𝐬:
𝐣𝐨𝐢𝐧(𝐨𝐭𝐡𝐞𝐫, 𝐨𝐧=𝐍𝐨𝐧𝐞, 𝐡𝐨𝐰=𝐍𝐨𝐧𝐞):  Joins with another DataFrame using the given join expression.
𝐮𝐧𝐢𝐨𝐧(): Combine two DataFrames with the same schema.
𝐢𝐧𝐭𝐞𝐫𝐬𝐞𝐜𝐭(): Return common rows between DataFrames.

𝟒. 𝐀𝐝𝐯𝐚𝐧𝐜𝐞𝐝 𝐓𝐫𝐚𝐧𝐬𝐟𝐨𝐫𝐦𝐚𝐭𝐢𝐨𝐧𝐬:
𝐰𝐢𝐭𝐡𝐂𝐨𝐥𝐮𝐦𝐧𝐑𝐞𝐧𝐚𝐦𝐞𝐝():  Rename a column.
𝐝𝐫𝐨𝐩𝐃𝐮𝐩𝐥𝐢𝐜𝐚𝐭𝐞𝐬(): Drop duplicate rows based on columns.
𝐬𝐚𝐦𝐩𝐥𝐞(): Sample a fraction of rows.
𝐥𝐢𝐦𝐢𝐭(): Limit the number of rows.

𝟓. 𝐖𝐢𝐧𝐝𝐨𝐰 𝐅𝐮𝐧𝐜𝐭𝐢𝐨𝐧𝐬:
𝐨𝐯𝐞𝐫(𝐰𝐢𝐧𝐝𝐨𝐰𝐒𝐩𝐞𝐜): Defines a window specification for window functions.
𝐫𝐨𝐰_𝐧𝐮𝐦𝐛𝐞𝐫().𝐨𝐯𝐞𝐫(𝐰𝐢𝐧𝐝𝐨𝐰𝐒𝐩𝐞𝐜): Assigns a row number starting at 1 within a window partition.
rank().over(windowSpec):  Provides the rank of rows within a window partition.

𝐀𝐜𝐭𝐢𝐨𝐧𝐬:
Actions trigger the execution of the transformations and return a result to the driver program or write data to an external storage system.

1. Basic Actions:
show(): Display the top rows of the DataFrame.
collect(): Return all rows as an array.
count(): Count the number of rows.
take(): Return the first N rows as an array.
first(): Return the first row.
head(): Return the first N rows.

2. Writing Data:
write(): Write the DataFrame to external storage.
write.mode(): Specify save mode (e.g., overwrite, append).
save(): Save the DataFrame to a specified path.
toJSON(): Convert the DataFrame to a JSON dataset.

3. Other Actions:
foreach(): Apply a function to each row.
foreachPartition(): Apply a function to each partition.

 

1 REPLY 1

ramyabodapati
New Contributor

Very informative thanks 😊 

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group