cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

How to convert column type from decimal to date in sparksql

pepevo
New Contributor III

I need to convert column type from decimal to date in sparksql when the format is not yyyy-mm-dd?

A table contains column data declared as decimal (38,0) and data is in yyyymmdd format and I am unable to run sql queries on it in databrick notebook.

I have tried to_date(column_name) = date_sub(current_date(),1) and it didn't work. I tried, "from_unixtime(cast(column_name as string), 'yyyy-MM-dd') or to_date(cast(unix_timestamp('column_name, 'yyyy-MM-dd') and none of them are work. From_unixtime(cast(column_name as string),'yyyy-mm-dd', it gives me right format but those data are not existed in the table.

How can I convert this column type to a date inside sql?

1 ACCEPTED SOLUTION

Accepted Solutions

User16857282152
Contributor

You can run

"describe tablename;" to get the datatypes of a column.

I ran a few quick examples and found the following to work, I think you are probably close but just some formatting issues.

Explore some of these examples and see if you can find your error.

Remember to delete any tables you create if you are on a shared workspace.

Create a table with a decimal as a column named id.

create table temp6 as (Select 20140419.00 id);

describe temp6;

-- to show it is indeed a decimal

Select from it and note that sql deals with int to float

%sql select * from temp6 where id = 20140419;

Also casts strings for you

%sql select * from temp6 where id = "20140419";

Do the same with ints using

create table temp7 as (Select 20140419 id);

This last example might get you close,

select * from temp6 where id < replace(cast(current_date() as String), '-');

Break that down by running

select current_date()

Then cast to string

select cast(current_date() as String);

Then drop the dashes

select replace(cast(current_date() as String), '-')

Then use that in your where clause, but modify as needed to suit your logic.

Note mine worked when I compared an int or a decimal to a string, sparkSQL casts as appropriate.

View solution in original post

10 REPLIES 10

User16857282152
Contributor

Here is a working example,

In SQL,

create table temp1 as (select 20010101.00 date);

-- This creates a table with a single column named "date" with a datatype of decimal.

-- To verify run "describe temp1";

the to_date function takes a string as an input so first cast the decimal to string.

select cast(date as String) from temp1;

Once you have a string you can push to a date,

select to_date(cast(date as String), 'yyyyMMdd') date from temp1;

You could do the same using dataframe api.

df1 = spark.sql("select 20010101.00 date")

Convert to string

df2 = df1.select(df1.date.cast("string"))

Drop right of decimal

from pyspark.sql.functions import substring_index

df3 = df2.select(substring_index(df2.date, '.', 1).alias('date'))

Convert String to Date

from pyspark.sql.functions import to_date

df4 = df3.select(to_date('date', 'yyyyMMdd').alias('date'))

pepevo
New Contributor III

thank you for your input.

I tried to select the column by itself from the table as you showed and yes, I get the format converted right from yyyyMMdd to yyyy-MM-dd on output, but when I put it in the where clause as below:
select count(*) from table_name where to_date(cast(column_name as string),'yyyyMMdd') = date_sub(current_date(),1));

my output is 0 instead of more than 500000 records.

User16857282152
Contributor

That seems strange,

Have you tried the reverse, convert the date you want to compare to a string, and then compare

select * from table where (your column cast to string) = date_format(date_sub(current_date(),1), 'yyyyMMdd')

I did and the output is still = 0.

* the column_name is load_dt_id

select count(*) from table_name WHERE to_date(cast(load_dt_id as string),'yyyyMMdd') = date_format(date_sub(),1),'yyyyMMdd');

There are 700 million records and the count can't be 0 if load_dt_id is sysdate -1.

The count from Oracle is at least a 500 million if the load_dt_id is sysdate -1.

User16857282152
Contributor

I would start by debugging a small collection of records,

First count nulls, perhaps your import failed to read that column for some formatting reason.

Select count(*) from table where target_column is NOT NULL;

compare with

Select count(*) from table where target_column is NULL;

If the table is massive and costs are significant, combine the above into a single query.

Then grab a few rows

Create testTable as select * from sourceTable where target_column is not null LIMIT 10;

Note the limit clause, just grab a few rows.

That should give you an idea of what is wrong, if anything with the actual data as imported and what you are expecting the data to be.

pepevo
New Contributor III

select count(*) from prty_cdim where load_dt_id is not null;

there are 732734948 records

select count(*) from prty_cdim where load_dt_id is null;

0 records

create table testtable as select * from prty_cdim where load_dt_is is not null limit 10;
select count(*) from testtable;

table testtable successful created and 10 rows in it.

select load_dt_id from testtable;

load_dt_id

20140419

20140419

..

20140419

10 rows

select * from testtable where to_date(cast(load_dt_id as string),'yyyyMMdd') = date_format(date_sub(current_date(),1),'yyyyMMdd');

OK with bulk glass 1 (don't know what it's meant?). When I click the bulk image, it's - 'Accelerate queries with Delta: This query contains a highly selective filter. To improve the performance of queries, convert the table to Delta and run the Opmitize ZORDER BY command on the table 'default'.'testtable"

pepevo
New Contributor III

I have been tried many data functions to compare with sysdate and no luck even with 10 records in a table. Is there a way I can compare sysdate as decimal?

even using, unix_timestamp, from_unixtime and none of them work.

Why scala sql is not friendly much. what would the following where clause look like in databricks:

WHERE to_date(trunc(load_dt_id),'yyyymmdd') = trunc(sysdate -1);

I tried "trunc(cast(load_dt_id as string),'yyyyMMdd') = date_format(date_sub(currentl_date(),1),'yyyyMMdd') ; "

or "

trunc(cast(load_dt_id as string),'yyyyMMdd') = trunc(date_sub(currentl_date(),1),'yyyyMMdd');

"

none of them are working

User16857282152
Contributor

You can run

"describe tablename;" to get the datatypes of a column.

I ran a few quick examples and found the following to work, I think you are probably close but just some formatting issues.

Explore some of these examples and see if you can find your error.

Remember to delete any tables you create if you are on a shared workspace.

Create a table with a decimal as a column named id.

create table temp6 as (Select 20140419.00 id);

describe temp6;

-- to show it is indeed a decimal

Select from it and note that sql deals with int to float

%sql select * from temp6 where id = 20140419;

Also casts strings for you

%sql select * from temp6 where id = "20140419";

Do the same with ints using

create table temp7 as (Select 20140419 id);

This last example might get you close,

select * from temp6 where id < replace(cast(current_date() as String), '-');

Break that down by running

select current_date()

Then cast to string

select cast(current_date() as String);

Then drop the dashes

select replace(cast(current_date() as String), '-')

Then use that in your where clause, but modify as needed to suit your logic.

Note mine worked when I compared an int or a decimal to a string, sparkSQL casts as appropriate.

pepevo
New Contributor III

thank you Tom. I made it work already.

Glad to hear that,

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.