Databricks Community

Nastia · ‎06-18-2024

I am trying to create a simple dlt pipeline:

@dlt.table

def today_latest_execution():

return spark.sql("SELECT * FROM LIVE.last_execution")

@on_event_hook

def write_events_to_x(event😞

if (

today_latest_execution().count() == 0

😞

try:

...

And I am getting and error:

'Dataset' object has no attribute 'count'

What I have tried: convertion to pandas (via ToPandas() or to_pandas_on_spark doesn't work), koalas doesn't work, using diff functions (not spark.sql) doesn't work... I am stuck 😞

How to make my function return me dataframe instead of dataset?

jacovangelder · ‎06-19-2024

I only just noticed you are using DLT. My bad.

The @Dlt.table decorator tells DLT to create a table that contains the result of a DataFrame.

Basically, you can't operate on the result of the function as you're used to operating on a DataFrame, but you need to operate on the DLT table it created, using dlt.read(<table_name>). If you want to do DataFrame operations on the table you've created, you need to use dlt.read(<table_name>).count()

Example:

@Dlt.table
def test():
  if dlt.read("today_latest_execution").count() >= 0:
    return dlt.read("today_latest_execution")

DLT works a lot differently than what you're used to with working with function return values.

Hope this helps!

Edit: argh, somehow my post keeps tagging user Dlt haha but I think you get the point!

View solution in original post

-werners- · ‎06-18-2024

can you try count() instead of count (without brackets)?

PS. a dataframe is a dataset of type row.

jacovangelder · ‎06-18-2024

You're missing the parenthesis: count()

Nastia · ‎06-19-2024

@jacovangelder @-werners- , yes yes, it has () there, sorry, copied the code wrongly

error is still the same though 😞

jacovangelder · ‎06-19-2024

I only just noticed you are using DLT. My bad.

The @Dlt.table decorator tells DLT to create a table that contains the result of a DataFrame.

Basically, you can't operate on the result of the function as you're used to operating on a DataFrame, but you need to operate on the DLT table it created, using dlt.read(<table_name>). If you want to do DataFrame operations on the table you've created, you need to use dlt.read(<table_name>).count()

Example:

@Dlt.table
def test():
  if dlt.read("today_latest_execution").count() >= 0:
    return dlt.read("today_latest_execution")

DLT works a lot differently than what you're used to with working with function return values.

Hope this helps!

Edit: argh, somehow my post keeps tagging user Dlt haha but I think you get the point!

-werners- · ‎06-19-2024

glad I work in scala and do no have to deal with DLT 😄

jacovangelder · ‎06-19-2024

Not a fan myself either! It seems DLT is getting a big rebrand with LakeFlow around the corner. In my experience DLT was never that widely adopted.

-werners- · ‎06-19-2024

what if you do:
return spark.sql("SELECT * FROM LIVE.last_execution").toDF()

Databricks Community

I keep getting dataset from spark.table command (instead of dataframe)

Connect with Databricks Users in Your Area

Databricks Learning Festival (Virtual): 15 January - 31 January 2025

How to present and share your Notebook insights in AI/BI Dashboards

Introducing an exclusively Databricks-hosted Assistant

Meet the Databricks MVPs

Insights from a global survey of 1,100 technologists and interviews with 28 CIOs