What is the difference between DataFrame.first(), head(), head(n), and take(n), show(), show(n)?

cfregly
Contributor
 

cfregly
Contributor

Sorted Data

If your data is sorted using either

sort()
or
ORDER BY
, these operations will be deterministic and return either the 1st element using first()/head() or the top-n using head(n)/take(n).

show()/show(n) return Unit (void) and will print up to the first 20 rows in a tabular form.

These operations may require a shuffle if there are any aggregations, joins, or sorts in the underlying query.

Unsorted Data

If the data is not sorted, these operations are not guaranteed to return the 1st or top-n elements - and a shuffle may not be required.

show()/show(n) return Unit (void) and will print up to 20 rows in a tabular form and in no particular order.

If no shuffle is required (no aggregations, joins, or sorts), these operations will be optimized to inspect enough partitions to satisfy the operation - likely a much smaller subset of the overall partitions of the dataset.

DivyaandData
Databricks Employee
Databricks Employee

These are action methods that return data -

first() : Returns the very first row of the dataframe as a single row.

head() : This does the same as first(), returns the first row

head(n): Returns an array or list of the first n rows

take(n): Similar to head(n), it retrieves the first n rows and returns them as an array

These action items display data-

show(): Prints the first 20 rows in a tabular format

show(n): Prints the first n rows in a tabular format