Data Engineering

Forum Posts

Sorted by:

by rishigc • New Contributor

04-25-2019 9:43:45 AM

12063 Views
1 replies
0 kudos

Split a row into multiple rows based on a column value in Spark SQL

Hi, I am trying to split a record in a table to 2 records based on a column value. Please refer to the sample below. The input table displays the 3 types of Product and their price. Notice that for a specific Product (row) only its corresponding col...

Data Engineering

12063 Views
1 replies
0 kudos

04-25-2019 9:43:45 AM

View Replies

Latest Reply

mathan_pillai
Valued Contributor

04-26-2019 3:31:30 AM

0 kudos

Hi @rishigc You can use something like below. SELECT explode(arrays_zip(split(Product, '+'), split(Price, '+') ) as product_and_price from df or df.withColumn("product_and_price", explode(arrays_zip(split(Product, '+'), split(Price, '+'))).select( ...

0 kudos

04-26-2019 3:31:30 AM

by siddhu308 • New Contributor II

04-22-2019 1:36:13 AM

4785 Views
2 replies
0 kudos

column wise sum in PySpark dataframe

i have a dataframe of 18000000rows and 1322 column with '0' and '1' value. want to find how many '1's are in every column ??? below is DataSet se_00001 se_00007 se_00036 se_00100 se_0010p se_00250

Data Engineering

4785 Views
2 replies
0 kudos

04-22-2019 1:36:13 AM

View Replies

Latest Reply

mathan_pillai
Valued Contributor

04-23-2019 7:41:14 AM

0 kudos

Hi Siddhu, You can use df.select(sum("col1"), sum("col2"), sum("col3")) where col1, col2, col3 are the column names for which you would like to find the sum please let us know if it answers your question Thanks

0 kudos

04-23-2019 7:41:14 AM

1 More Replies

by Pascalvan_Belle • New Contributor

04-16-2019 11:50:04 PM

6315 Views
1 replies
0 kudos

How to create a surrogate key sequence which I can use in SCD cases?

Hi Community I would like to know if there is an option to create an integer sequence which persists even if the cluster is shut down. My target is to use this integer value as a surrogate key to join different tables or do Slowly changing dimensio...

Data Engineering

6315 Views
1 replies
0 kudos

04-16-2019 11:50:04 PM

View Replies

Latest Reply

girivaratharaja
New Contributor III

04-17-2019 2:43:39 PM

0 kudos

Hi @pascalvanbellen ,There is no concept of FK, PK, SK in Spark. But Databricks Delta automatically takes care of SCD type scenarios. https://docs.databricks.com/spark/latest/spark-sql/language-manual/merge-into.html#slowly-changing-data-scd-type-2 ...

0 kudos

04-17-2019 2:43:39 PM

by srchella • New Contributor

03-04-2019 11:58:17 PM

2130 Views
1 replies
0 kudos

How to take distinct of multiple columns ( > than 2 columns) in pyspark datafarme ?

I have 10+ columns and want to take distinct rows by multiple columns into consideration. How to achieve this using pyspark dataframe functions ?

Data Engineering

2130 Views
1 replies
0 kudos

03-04-2019 11:58:17 PM

View Replies

Latest Reply

Sandeep
Contributor III

03-28-2019 8:06:05 AM

0 kudos

You can use dropDuplicates https://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=distinct#pyspark.sql.DataFrame.dropDuplicates

0 kudos

03-28-2019 8:06:05 AM

by cfregly • Contributor

03-09-2015 5:27:56 PM

10367 Views
15 replies
0 kudos

What is the difference between registerTempTable() and saveAsTable()?

Data Engineering

10367 Views
15 replies
0 kudos

03-09-2015 5:27:56 PM

View Replies

Latest Reply

wildhogg
New Contributor II

03-28-2019 6:39:55 AM

0 kudos

Well, just a little bit research, and i found this post below: Hopefully this will help. " registerTempTable() registerTempTable() creates an in-memory table that is scoped to the cluster in which it was created. The data is stored using Hive's high...

0 kudos

03-28-2019 6:39:55 AM

14 More Replies

by DavidWrench • New Contributor II

10-08-2018 10:52:19 AM

12679 Views
4 replies
0 kudos

Displaying HTML Output

I am trying to display the html output or read in an html file to display in databricks notebook from pandas-profiling.import pandas as pd import pandas_profiling df = pd.read_csv("/dbfs/FileStore/tables/my_data.csv", header='infer', parse_dates=Tru...

Data Engineering

12679 Views
4 replies
0 kudos

10-08-2018 10:52:19 AM

View Replies

Latest Reply

Bendu_Preez
New Contributor II

03-23-2019 5:06:04 PM

0 kudos

What eventually worked for me was displayHTML(profile.to_html()) for the pandas_profiling and displayHTML(profile.html) for the spark_profiling.

0 kudos

03-23-2019 5:06:04 PM

3 More Replies

by AdamArold • New Contributor

03-29-2018 5:05:26 AM

4290 Views
4 replies
0 kudos

How can I integrate DataBricks into PyCharm?

Editing notebooks on DataBricks is rather cumbersome because it lacks a lot of features IDEs like PyCharm have. Another problem is that a DataBricks notebook comes with some local state which are not present on my computer. How can I edit notebooks ...

Data Engineering

4290 Views
4 replies
0 kudos

03-29-2018 5:05:26 AM

View Replies

Latest Reply

SimonD_Morias
New Contributor II

03-21-2019 1:47:41 AM

0 kudos

The documents are out for databricks-connect: https://docs.azuredatabricks.net/user-guide/dev-tools/db-connect.html I've also written up about a few limitations I have found - some with workarounds: https://datathirst.net/blog/2019/3/7/databricks-co...

0 kudos

03-21-2019 1:47:41 AM

3 More Replies

by microamp • New Contributor II

01-26-2018 2:52:59 AM

8560 Views
12 replies
0 kudos

Azure Data Lake Config Issue: No value for dfs.adls.oauth2.access.token.provider found in conf file.

Hi,I have files hosted on an Azure Data Lake Store which I can connect from Azure Databricks configured as per instructions here.I can read JSON files fine, however, I'm getting the following error when I try to read an Avro file.spark.read.format("c...

Data Engineering

8560 Views
12 replies
0 kudos

01-26-2018 2:52:59 AM

View Replies

Latest Reply

User16301467523
New Contributor II

06-11-2018 3:46:47 PM

0 kudos

Taras's answer is correct. Because spark-avro is based on the RDD APIs, the properties must be set in the hadoopConfiguration options. Please note these docs for configuration using the RDD API: https://docs.azuredatabricks.net/spark/latest/data-sou...

0 kudos

06-11-2018 3:46:47 PM

11 More Replies

by PranjalThapar • New Contributor

05-04-2017 12:52:26 PM

5458 Views
4 replies
0 kudos

Splitting Date into Year, Month and Day, with inconsistent delimiters

I am trying to split my Date Column which is a String Type right now into 3 columns Year, Month and Date. I use (PySpark): <code>split_date=pyspark.sql.functions.split(df['Date'], '-') df= df.withColumn('Year', split_date.getItem(0)) df= df.wit...

Data Engineering

5458 Views
4 replies
0 kudos

05-04-2017 12:52:26 PM

View Replies

Latest Reply

youssefassouli
New Contributor II

02-26-2019 3:05:19 AM

0 kudos

thank you so much that was halpful

0 kudos

02-26-2019 3:05:19 AM

3 More Replies

by dan11 • New Contributor II

03-04-2016 8:46:20 PM

2337 Views
4 replies
1 kudos

sql delete?

<pre> Hello databricks people, I started working with databricks today. I have a sql script which I developed with sqlite3 on a laptop. I want to port the script to databricks. I started with two sql statements: select count(prop_id) from prop0; del...

Data Engineering

2337 Views
4 replies
1 kudos

03-04-2016 8:46:20 PM

View Replies

Latest Reply

Bill_Chambers
Contributor II

03-11-2016 9:57:05 AM

1 kudos

Hey Dan, good to hear you're getting started with Databricks. This is not a limitation of Databricks it's a restriction built into Spark itself. Spark is not a data store, it's a distributed computation framework. Therefore deleting data would be un...

1 kudos

03-11-2016 9:57:05 AM

3 More Replies

by shampa • New Contributor

01-19-2019 11:00:31 PM

3982 Views
1 replies
0 kudos

How can we compare two dataframes in spark scala to find difference between these 2 files, which column ?? and value ??.

I have two files and I created two dataframes prod1 and prod2 out of it.I need to find the records with column names and values that are not matching in both the dfs. id_sk is the primary key .all the cols are string datatype dataframe 1 (prod1) id_...

Data Engineering

3982 Views
1 replies
0 kudos

01-19-2019 11:00:31 PM

View Replies

Latest Reply

manojlukhi
New Contributor II

02-04-2019 10:14:48 PM

0 kudos

use full Outer Join in spark SQL

0 kudos

02-04-2019 10:14:48 PM

by ArielHerrera • New Contributor II

01-25-2019 10:33:12 AM

13126 Views
2 replies
0 kudos

Resolved! How to create blank target links in markdown to open url link in new tabs?

I am using markdown to include links urls. I am using the below markdown syntax: [link text](http://example.com) The issue is each time I click the linked text it opens the url in the same tab as the notebook. I want the url to open it in a new ta...

Data Engineering

13126 Views
2 replies
0 kudos

01-25-2019 10:33:12 AM

View Replies

Latest Reply

Anonymous
Not applicable

01-30-2019 10:25:51 AM

0 kudos

Hi @Ariel Herrera, You can just put html anchor tag in databricks notebook cell. It will open a new tab when you click it. Please try the example below. It works for me in databricks notebook. %md <a href="https://google.com" target="_blank">google ...

0 kudos

01-30-2019 10:25:51 AM

1 More Replies

by cfregly • Contributor

05-03-2015 12:28:53 PM

5604 Views
5 replies
0 kudos

How can I view and change the SparkConf settings if the SparkContext (sc) is already provided for me?

Data Engineering

5604 Views
5 replies
0 kudos

05-03-2015 12:28:53 PM

View Replies

Latest Reply

MatthewValenti
New Contributor II

01-13-2019 5:32:03 PM

0 kudos

This is an old post, however, is this still accurate for the latest version of Databricks in 2019? If so, how to approach the following?1. Connect to many MongoDBs.2. Connect to MongoDB when connection string information is dynamic (i.e. stored in s...

0 kudos

01-13-2019 5:32:03 PM

4 More Replies

by senthilkumar • New Contributor

01-16-2017 6:42:09 AM

13942 Views
1 replies
0 kudos

How filter condition working in spark dataframe?

I have a table in hbase with 1 billions records.I want to filter the records based on certain condition (by date). For example: Dataframe.filter(col(date) === todayDate) Filter will be applied after all records from the table will be loaded into me...

Data Engineering

13942 Views
1 replies
0 kudos

01-16-2017 6:42:09 AM

View Replies

Latest Reply

muk1
New Contributor II

12-19-2018 2:11:07 AM

0 kudos

Hello @senthil kumar To pass external values to the filter (or where) transformations you can use the "lit" function in the following way:Dataframe.filter(col(date) == lit(todayDate))don´t know if that helps. Be careful with the schema infered by th...

0 kudos

12-19-2018 2:11:07 AM

by DominicRobinson • New Contributor II

12-11-2018 12:13:13 PM

8537 Views
4 replies
0 kudos

Issues with UTF-16 files and unicode characters

Can someone please offer some insight - I've spent days trying to solve this issue We have the task of loading in hundreds of tab seperated text files encoded in UTF-16 little endian with a tab delimiter. Our organisation is an international one and...

Data Engineering

8537 Views
4 replies
0 kudos

12-11-2018 12:13:13 PM

View Replies

Latest Reply

User16817872376
New Contributor III

12-12-2018 2:05:09 PM

0 kudos

You can also always read in the file as a textFile, and then run a UTF-16 decoder/encoder library as a UDF on the text.

0 kudos

12-12-2018 2:05:09 PM

3 More Replies

User

Count

1602

736

344

284

247

Databricks

Forum Posts

Split a row into multiple rows based on a column value in Spark SQL

column wise sum in PySpark dataframe

How to create a surrogate key sequence which I can use in SCD cases?

How to take distinct of multiple columns ( > than 2 columns) in pyspark datafarme ?

What is the difference between registerTempTable() and saveAsTable()?

Displaying HTML Output

How can I integrate DataBricks into PyCharm?

Azure Data Lake Config Issue: No value for dfs.adls.oauth2.access.token.provider found in conf file.

Splitting Date into Year, Month and Day, with inconsistent delimiters

sql delete?

How can we compare two dataframes in spark scala to find difference between these 2 files, which column ?? and value ??.

Resolved! How to create blank target links in markdown to open url link in new tabs?

How can I view and change the SparkConf settings if the SparkContext (sc) is already provided for me?

How filter condition working in spark dataframe?

Issues with UTF-16 files and unicode characters

Best way to parse Google Analytics data in Databri...

DELTA_EXCEED_CHAR_VARCHAR_LIMIT

Not able to set run_as service_principal_name

Pyspark operations slowness in CLuster 14.3LTS as ...

[Databricks Assets Bundles] Workflow trigger on fi...