topic Re: I keep getting dataset from spark.table command (instead of dataframe) in Data Engineering

I keep getting dataset from spark.table command (instead of dataframe)

Nastia — Wed, 19 Jun 2024 08:26:49 GMT

I am trying to create a simple dlt pipeline:

@dlt.table

def today_latest_execution():

return spark.sql("SELECT * FROM LIVE.last_execution")

@on_event_hook

def write_events_to_x(event😞

if (

today_latest_execution().count() == 0

😞

try:

...

And I am getting and error:

'Dataset' object has no attribute 'count'

What I have tried: convertion to pandas (via ToPandas() or to_pandas_on_spark doesn't work), koalas doesn't work, using diff functions (not spark.sql) doesn't work... I am stuck 😞

How to make my function return me dataframe instead of dataset?

Re: I keep getting dataset from spark.table command (instead of dataframe)

-werners- — Tue, 18 Jun 2024 14:10:31 GMT

can you try count() instead of count (without brackets)?

PS. a dataframe is a dataset of type row.

Re: I keep getting dataset from spark.table command (instead of dataframe)

jacovangelder — Tue, 18 Jun 2024 18:01:41 GMT

You're missing the parenthesis: count()

Re: I keep getting dataset from spark.table command (instead of dataframe)

Nastia — Wed, 19 Jun 2024 08:28:19 GMT

@jacovangelder @-werners- , yes yes, it has () there, sorry, copied the code wrongly

error is still the same though 😞

Re: I keep getting dataset from spark.table command (instead of dataframe)

-werners- — Wed, 19 Jun 2024 09:14:32 GMT

what if you do:
return spark.sql("SELECT * FROM LIVE.last_execution").toDF()

Re: I keep getting dataset from spark.table command (instead of dataframe)

jacovangelder — Wed, 19 Jun 2024 09:37:59 GMT

I only just noticed you are using DLT. My bad.

The @Dlt.table decorator tells DLT to create a table that contains the result of a DataFrame.

Basically, you can't operate on the result of the function as you're used to operating on a DataFrame, but you need to operate on the DLT table it created, using dlt.read(<table_name>). If you want to do DataFrame operations on the table you've created, you need to use dlt.read(<table_name>).count()

Example:

@Dlt.table def test(): if dlt.read("today_latest_execution").count() >= 0: return dlt.read("today_latest_execution")

DLT works a lot differently than what you're used to with working with function return values.

Hope this helps!

Edit: argh, somehow my post keeps tagging user Dlt haha but I think you get the point!

Re: I keep getting dataset from spark.table command (instead of dataframe)

-werners- — Wed, 19 Jun 2024 09:42:03 GMT

glad I work in scala and do no have to deal with DLT 😄

Re: I keep getting dataset from spark.table command (instead of dataframe)

jacovangelder — Wed, 19 Jun 2024 09:44:39 GMT

Not a fan myself either! It seems DLT is getting a big rebrand with LakeFlow around the corner. In my experience DLT was never that widely adopted.