02-10-2020 07:23 AM
I need to convert column type from decimal to date in sparksql when the format is not yyyy-mm-dd?
A table contains column data declared as decimal (38,0) and data is in yyyymmdd format and I am unable to run sql queries on it in databrick notebook.
I have tried to_date(column_name) = date_sub(current_date(),1) and it didn't work. I tried, "from_unixtime(cast(column_name as string), 'yyyy-MM-dd') or to_date(cast(unix_timestamp('column_name, 'yyyy-MM-dd') and none of them are work. From_unixtime(cast(column_name as string),'yyyy-mm-dd', it gives me right format but those data are not existed in the table.
How can I convert this column type to a date inside sql?
02-13-2020 06:42 AM
You can run
"describe tablename;" to get the datatypes of a column.
I ran a few quick examples and found the following to work, I think you are probably close but just some formatting issues.
Explore some of these examples and see if you can find your error.
Remember to delete any tables you create if you are on a shared workspace.
Create a table with a decimal as a column named id.
create table temp6 as (Select 20140419.00 id);
describe temp6;
-- to show it is indeed a decimal
Select from it and note that sql deals with int to float
%sql select * from temp6 where id = 20140419;
Also casts strings for you
%sql select * from temp6 where id = "20140419";
Do the same with ints using
create table temp7 as (Select 20140419 id);
This last example might get you close,
select * from temp6 where id < replace(cast(current_date() as String), '-');
Break that down by running
select current_date()
Then cast to string
select cast(current_date() as String);
Then drop the dashes
select replace(cast(current_date() as String), '-')
Then use that in your where clause, but modify as needed to suit your logic.
Note mine worked when I compared an int or a decimal to a string, sparkSQL casts as appropriate.
02-10-2020 08:24 AM
Here is a working example,
In SQL,
create table temp1 as (select 20010101.00 date);
-- This creates a table with a single column named "date" with a datatype of decimal.
-- To verify run "describe temp1";
the to_date function takes a string as an input so first cast the decimal to string.
select cast(date as String) from temp1;
Once you have a string you can push to a date,
select to_date(cast(date as String), 'yyyyMMdd') date from temp1;
You could do the same using dataframe api.
df1 = spark.sql("select 20010101.00 date")
Convert to string
df2 = df1.select(df1.date.cast("string"))
Drop right of decimal
from pyspark.sql.functions import substring_index
df3 = df2.select(substring_index(df2.date, '.', 1).alias('date'))
Convert String to Date
from pyspark.sql.functions import to_date
df4 = df3.select(to_date('date', 'yyyyMMdd').alias('date'))
02-10-2020 08:44 AM
thank you for your input.
I tried to select the column by itself from the table as you showed and yes, I get the format converted right from yyyyMMdd to yyyy-MM-dd on output, but when I put it in the where clause as below:select count(*) from table_name where to_date(cast(column_name as string),'yyyyMMdd') = date_sub(current_date(),1));
my output is 0 instead of more than 500000 records.
02-10-2020 10:53 AM
That seems strange,
Have you tried the reverse, convert the date you want to compare to a string, and then compare
select * from table where (your column cast to string) = date_format(date_sub(current_date(),1), 'yyyyMMdd')
02-10-2020 11:03 AM
I did and the output is still = 0.
* the column_name is load_dt_id
select count(*) from table_name WHERE to_date(cast(load_dt_id as string),'yyyyMMdd') = date_format(date_sub(),1),'yyyyMMdd');
There are 700 million records and the count can't be 0 if load_dt_id is sysdate -1.
The count from Oracle is at least a 500 million if the load_dt_id is sysdate -1.
02-11-2020 09:14 AM
I would start by debugging a small collection of records,
First count nulls, perhaps your import failed to read that column for some formatting reason.
Select count(*) from table where target_column is NOT NULL;
compare with
Select count(*) from table where target_column is NULL;
If the table is massive and costs are significant, combine the above into a single query.
Then grab a few rows
Create testTable as select * from sourceTable where target_column is not null LIMIT 10;
Note the limit clause, just grab a few rows.
That should give you an idea of what is wrong, if anything with the actual data as imported and what you are expecting the data to be.
02-11-2020 10:08 AM
select count(*) from prty_cdim where load_dt_id is not null;
there are 732734948 records
select count(*) from prty_cdim where load_dt_id is null;
0 records
create table testtable as select * from prty_cdim where load_dt_is is not null limit 10;
select count(*) from testtable;
table testtable successful created and 10 rows in it.
select load_dt_id from testtable;
load_dt_id
20140419
20140419
..
20140419
10 rows
select * from testtable where to_date(cast(load_dt_id as string),'yyyyMMdd') = date_format(date_sub(current_date(),1),'yyyyMMdd');
OK with bulk glass 1 (don't know what it's meant?). When I click the bulk image, it's - 'Accelerate queries with Delta: This query contains a highly selective filter. To improve the performance of queries, convert the table to Delta and run the Opmitize ZORDER BY command on the table 'default'.'testtable"
02-12-2020 09:52 AM
I have been tried many data functions to compare with sysdate and no luck even with 10 records in a table. Is there a way I can compare sysdate as decimal?
even using, unix_timestamp, from_unixtime and none of them work.
Why scala sql is not friendly much. what would the following where clause look like in databricks:
WHERE to_date(trunc(load_dt_id),'yyyymmdd') = trunc(sysdate -1);
I tried "trunc(cast(load_dt_id as string),'yyyyMMdd') = date_format(date_sub(currentl_date(),1),'yyyyMMdd') ; "
or "
trunc(cast(load_dt_id as string),'yyyyMMdd') = trunc(date_sub(currentl_date(),1),'yyyyMMdd');
"
none of them are working
02-13-2020 06:42 AM
You can run
"describe tablename;" to get the datatypes of a column.
I ran a few quick examples and found the following to work, I think you are probably close but just some formatting issues.
Explore some of these examples and see if you can find your error.
Remember to delete any tables you create if you are on a shared workspace.
Create a table with a decimal as a column named id.
create table temp6 as (Select 20140419.00 id);
describe temp6;
-- to show it is indeed a decimal
Select from it and note that sql deals with int to float
%sql select * from temp6 where id = 20140419;
Also casts strings for you
%sql select * from temp6 where id = "20140419";
Do the same with ints using
create table temp7 as (Select 20140419 id);
This last example might get you close,
select * from temp6 where id < replace(cast(current_date() as String), '-');
Break that down by running
select current_date()
Then cast to string
select cast(current_date() as String);
Then drop the dashes
select replace(cast(current_date() as String), '-')
Then use that in your where clause, but modify as needed to suit your logic.
Note mine worked when I compared an int or a decimal to a string, sparkSQL casts as appropriate.
02-13-2020 11:35 AM
thank you Tom. I made it work already.
02-13-2020 02:37 PM
Glad to hear that,
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.
If there isn’t a group near you, start one and help create a community that brings people together.
Request a New Group