cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Forum Posts

yopbibo
by Contributor II
  • 5558 Views
  • 3 replies
  • 4 kudos

Resolved! Column name, starting with a number

Hi,I see it is possible to start a column name with a number, like `123_test`And store in a hive table with a location in delta.On that documentation https://www.stitchdata.com/docs/destinations/databricks-delta/reference#transformations--column-nami...

  • 5558 Views
  • 3 replies
  • 4 kudos
Latest Reply
yopbibo
Contributor II
  • 4 kudos

ha ha, yes, I try to find back the right page in DB documentation. If you have it, please, share.

  • 4 kudos
2 More Replies
lizou
by Contributor II
  • 3655 Views
  • 2 replies
  • 2 kudos

Resolved! How to find the identity column seed value?

How to find the identity column seed value? A seed value is required when we need specifically like start generating new values from a number (most likely we need to keep the original key values when data is reloaded from another source, and any new ...

  • 3655 Views
  • 2 replies
  • 2 kudos
Latest Reply
lizou
Contributor II
  • 2 kudos

found it, thanks!of course, it will be nice to have a sql function available to query the value.example\"delta.identity.start\":984888,\"delta.identity.highWaterMark\":1004409,\"comment\":\"identity\",\"delta.identity.step\":1}

  • 2 kudos
1 More Replies
dzlab
by New Contributor
  • 807 Views
  • 0 replies
  • 0 kudos

Determine what is the interval in a timestamp column

OK so I'm trying to determine if a timestamp column has a regular interval or not, i.e. the difference between each consecutive value is the same across the entire column.I tried something like thisval timeColumn: String =   val groupByColumn: String...

  • 807 Views
  • 0 replies
  • 0 kudos
guruv
by New Contributor III
  • 18718 Views
  • 4 replies
  • 5 kudos

Resolved! parquet file to include partitioned column in file

HI,I have a daily scheduled job which processes the data and write as parquet file in a specific folder structure like root_folder/{CountryCode}/parquetfiles. Where each day job will write new data for countrycode under the folder for countrycodeI am...

  • 18718 Views
  • 4 replies
  • 5 kudos
Latest Reply
Hubert-Dudek
Esteemed Contributor III
  • 5 kudos

Most external consumers will read partition as column when are properly configured (for example Azure Data Factory or Power BI).Only way around is that you will duplicate column with other name (you can not have the same name as it will generate conf...

  • 5 kudos
3 More Replies
prasadvaze
by Valued Contributor II
  • 20788 Views
  • 7 replies
  • 3 kudos

Resolved! How to make delta table column values case-insensitive?

 we have many delta tables with string columns as unique key (PK in traditional relational db) and we don't want to insert new row because key value only differs in case. Its lot of code change to use upper/lower function on column value compare (in ...

  • 20788 Views
  • 7 replies
  • 3 kudos
Latest Reply
lizou
Contributor II
  • 3 kudos

Well, the unintended benefit is now I am using int\big int as surrogate keysfor all tables (preferred in DW). All joins are made on integer data types. Query efficiency is also improved.The string matching using upper() is done only on ETL when com...

  • 3 kudos
6 More Replies
Erik
by Valued Contributor III
  • 5425 Views
  • 6 replies
  • 7 kudos

Databricks query performance when filtering on a column correlated to the partition-column

(This is a copy of a question I asked on stackoverflow here, but maybe this community is a better fit for the question):Setting: Delta-lake, Databricks SQL compute used by powerbi. I am wondering about the following scenario: We have a column `timest...

  • 5425 Views
  • 6 replies
  • 7 kudos
Latest Reply
Hubert-Dudek
Esteemed Contributor III
  • 7 kudos

In query I would just query first by date (generated from timestamp which we want to query) and than by exact timestamp, so it will use partitioning benefit.

  • 7 kudos
5 More Replies
User16790091296
by Contributor II
  • 7646 Views
  • 1 replies
  • 0 kudos

How to add a new datetime column to a spark dataFrame from existing timestamp column

I have a data frame in Spark that has a column timestamp. I want to add a new column to this data frame that has the DateTime in the below format created from this existing timestamp column.“YYYY-MM-DD HH:MM:SS”

  • 7646 Views
  • 1 replies
  • 0 kudos
Latest Reply
Srikanth_Gupta_
Databricks Employee
  • 0 kudos

val df = Seq(("2021-11-05 02:46:47.154410"),("2019-10-05 2:46:47.154410")).toDF("old_column")display(df)import org.apache.spark.sql.functions._val df2 = df.withColumn("new_column", from_unixtime(unix_timestamp(col("old_column"), "yyyy-MM-dd HH:mm:ss....

  • 0 kudos
User16826992666
by Valued Contributor
  • 4097 Views
  • 1 replies
  • 0 kudos

How do I choose which column to partition by?

I am in the process of building my data pipeline, but I am unsure of how to choose which fields in my data I should use for partitioning. What should I be considering when choosing a partitioning strategy?

  • 4097 Views
  • 1 replies
  • 0 kudos
Latest Reply
brickster_2018
Databricks Employee
  • 0 kudos

The important factors deciding partition columns are:Even distribution of data. Choose the column that is commonly or widely accessed or queried. Do not create multiple levels of partition, as you can end up with a large number of small files.

  • 0 kudos
bhosskie
by New Contributor
  • 15594 Views
  • 9 replies
  • 0 kudos

How to merge two data frames column-wise in Apache Spark

I have the following two data frames which have just one column each and have exact same number of rows. How do I merge them so that I get a new data frame which has the two columns and all rows from both the data frames. For example, df1: +-----+...

  • 15594 Views
  • 9 replies
  • 0 kudos
Latest Reply
AmolZinjade
New Contributor II
  • 0 kudos

@bhosskie from pyspark.sql import SparkSession spark = SparkSession.builder.appName("Spark SQL basic example").enableHiveSupport().getOrCreate() sc = spark.sparkContext sqlDF1 = spark.sql("select count(*) as Total FROM user_summary") sqlDF2 = sp...

  • 0 kudos
8 More Replies
kruhly
by New Contributor II
  • 33913 Views
  • 12 replies
  • 0 kudos

Resolved! Is there a better method to join two dataframes and not have a duplicated column?

I would like to keep only one of the columns used to join the dataframes. Using select() after the join does not seem straight forward because the real data may have many columns or the column names may not be known. A simple example belowllist = [(...

  • 33913 Views
  • 12 replies
  • 0 kudos
Latest Reply
TejuNC
New Contributor II
  • 0 kudos

This is an expected behavior. DataFrame.join method is equivalent to SQL join like thisSELECT*FROM a JOIN b ON joinExprsIf you want to ignore duplicate columns just drop them or select columns of interest afterwards. If you want to disambiguate you c...

  • 0 kudos
11 More Replies
Labels