Data Engineering

Forum Posts

Sorted by:

by tripplehay777 • New Contributor

09-01-2016 12:41:37 AM

10656 Views
1 replies
0 kudos

How can I create a Table from a CSV file with first column with data in dictionary format (JSON like)?

I have a csv file with the first column containing data in dictionary form (keys: value). [see below] I tried to create a table by uploading the csv file directly to databricks but the file can't be read. Is there a way for me to flatten or conver...

Data Engineering

10656 Views
1 replies
0 kudos

09-01-2016 12:41:37 AM

View Replies

Latest Reply

MaxStruever
New Contributor II

08-15-2019 12:37:19 PM

0 kudos

This is apparently a known issue, databricks has their own csv format handler which can handle this https://github.com/databricks/spark-csv SQL API CSV data source for Spark can infer data types: CREATE TABLE cars USING com.databricks.spark.csv OP...

0 kudos

08-15-2019 12:37:19 PM

by tonyp • New Contributor II

02-12-2019 8:29:22 PM

13011 Views
1 replies
1 kudos

How to pass a python variables to shell script.?

How to pass a python variables to shell script.in databricks notebook, The python parameters can passed from the 1 st cmd to next %sh cmd .?

Data Engineering

13011 Views
1 replies
1 kudos

02-12-2019 8:29:22 PM

View Replies

Latest Reply

erikvisser1
New Contributor II

08-14-2019 7:46:45 AM

1 kudos

I found the answer here: https://stackoverflow.com/questions/54662605/how-to-pass-a-python-variables-to-shell-script-in-azure-databricks-notebookbles basically: %python import os l =['A','B','C','D'] os.environ['LIST']=' '.join(l)print(os.getenv('L...

1 kudos

08-14-2019 7:46:45 AM

by EmilianoParizz1 • New Contributor

05-09-2019 11:24:29 AM

5450 Views
4 replies
0 kudos

How to set the timestamp format when reading CSV

I have a Databricks 5.3 cluster on Azure which runs Apache Spark 2.4.0 and Scala 2.11.I'm trying to parse a CSV file with a custom timestamp format but I don't know which datetime pattern format Spark uses.My CSV looks like this: Timestamp, Name, Va...

Data Engineering

5450 Views
4 replies
0 kudos

05-09-2019 11:24:29 AM

View Replies

Latest Reply

wellington72019
New Contributor II

08-12-2019 11:46:52 PM

0 kudos

# in python: explicitly define the schema, read in CSV data using the schema and a defined timestamp format.... <a href="http://thestoreguide.co.nz/auckland/orewa/mcdonalds-orewa-akl-0931/">McDonald’s in Orewa</a>

0 kudos

08-12-2019 11:46:52 PM

3 More Replies

by martinch • New Contributor II

03-01-2019 7:40:53 AM

8674 Views
4 replies
0 kudos

DROP TABLE IF EXISTS does not work

When I try to run the command spark.sql("DROP TABLE IF EXISTS table_to_drop") and the table does not exist, I get the following error: AnalysisException: "Table or view 'table_to_drop' not found in database 'null';;\nDropTableCommand `table_to_drop...

Data Engineering

8674 Views
4 replies
0 kudos

03-01-2019 7:40:53 AM

View Replies

Latest Reply

StevenWilliams
New Contributor II

07-30-2019 5:46:30 AM

0 kudos

I agree about this being a usability bug. Documentation clearly states that if the optional flag "IF EXISTS" is provided that the statement will do nothing.https://docs.databricks.com/spark/latest/spark-sql/language-manual/drop-table.htmlDrop Table ...

0 kudos

07-30-2019 5:46:30 AM

3 More Replies

by Dee • New Contributor

08-14-2018 10:21:15 PM

7476 Views
2 replies
0 kudos

Resolved! How to Change Schema of a Spark SQL

I am new to Spark and just started an online pyspark tutorial. I uploaded the json data in DataBrick and wrote the commands as follows: df = sqlContext.sql("SELECT * FROM people_json") df.printSchema() from pyspark.sql.types import * data_schema =...

Data Engineering

7476 Views
2 replies
0 kudos

08-14-2018 10:21:15 PM

View Replies

Latest Reply

bhanu2448
New Contributor II

07-20-2019 10:24:25 AM

0 kudos

http://www.bigdatainterview.com/

0 kudos

07-20-2019 10:24:25 AM

1 More Replies

by GuidoPereyra_ • New Contributor II

10-30-2018 10:35:01 AM

6297 Views
2 replies
0 kudos

Databricks Delta - UPDATE error

Hi, We got the following error when we tried to UPDATE a delta table running concurrent notebooks that all end with an update to the same table. " com.databricks.sql.transaction.tahoe.ConcurrentAppendException: Files were added matching 'true' by a ...

Data Engineering

6297 Views
2 replies
0 kudos

10-30-2018 10:35:01 AM

View Replies

Latest Reply

GuidoPereyra_
New Contributor II

06-21-2019 7:10:13 AM

0 kudos

Hi @matt@direction.consulting I just found the following doc https://docs.azuredatabricks.net/delta/isolation-level.html#set-the-isolation-level. In my case, I could fixed partitioning the table and I think is the only way for concurrent update in t...

0 kudos

06-21-2019 7:10:13 AM

1 More Replies

by kali_tummala • New Contributor II

06-06-2019 11:29:09 AM

5925 Views
5 replies
0 kudos

Why Databricks spark is faster than AWS EMR Spark ?

https://databricks.com/blog/2017/07/12/benchmarking-big-data-sql-platforms-in-the-cloud.html Hi All, just wondering why Databricks Spark is lot faster on S3 compared with AWS EMR spark both the systems are on spark version 2.4 , is Databricks have ...

Data Engineering

5925 Views
5 replies
0 kudos

06-06-2019 11:29:09 AM

View Replies

Latest Reply

RafiKurlansik
New Contributor III

06-11-2019 6:59:36 PM

0 kudos

I think you can get some pretty good insight into the optimizations on Databricks here:https://docs.databricks.com/delta/delta-on-databricks.html Specifically, check out the sections on caching, z-ordering, and join optimization. There's also a grea...

0 kudos

06-11-2019 6:59:36 PM

4 More Replies

by DanielAnderson • New Contributor

02-18-2018 3:57:10 AM

4566 Views
1 replies
0 kudos

"AmazonS3Exception: The bucket is in this region" error

I have read access to an S3 bucket in an AWS account that is not mine. For more than a year I've had a job successfully reading from that bucket using dbutils.fs.mount(...) and sqlContext.read.json(...). Recently the job started failing with the exc...

Data Engineering

4566 Views
1 replies
0 kudos

02-18-2018 3:57:10 AM

View Replies

Latest Reply

Chandan
New Contributor II

06-07-2019 11:59:23 PM

0 kudos

@andersource Looks like the bucket is in us-east-1 but you've configured your AmazonS3 Cloud platform with us-west-2. Can you try switching configuring the client to use us-east-1 ? I hope it will work for you. Thank you

0 kudos

06-07-2019 11:59:23 PM

by User16301465121 • New Contributor

04-18-2015 8:21:24 AM

8407 Views
3 replies
0 kudos

How can I exit from a Notebook which is used as a job?

How can I quit from a notebook in the middle of an execution based on some condition?

Data Engineering

8407 Views
3 replies
0 kudos

04-18-2015 8:21:24 AM

View Replies

Latest Reply

SamsonXia
New Contributor II

05-28-2019 11:32:20 AM

0 kudos

exit(value: String): voidCalling dbutils.notebook.exit in a job causes the notebook to complete successfully. If you want to cause the job to fail, throw an exception.

0 kudos

05-28-2019 11:32:20 AM

2 More Replies

by _not_provid1755 • New Contributor

03-18-2019 4:42:57 PM

4835 Views
3 replies
0 kudos

Write empty dataframe into csv

I'm writing my output (entity) data frame into csv file. Below statement works well when the data frame is non-empty. entity.repartition(1).write.mode(SaveMode.Overwrite).format("csv").option("header", "true").save(tempLocation) It's not working wh...

Data Engineering

4835 Views
3 replies
0 kudos

03-18-2019 4:42:57 PM

View Replies

Latest Reply

mrnov
New Contributor II

05-07-2019 7:23:29 AM

0 kudos

the same problem here (similar code and the same behavior with Spark 2.4.0, running with spark submit on Win and on Lin) dataset.coalesce(1) .write() .option("charset", "UTF-8") .option("header", "true") .mode(SaveMod...

0 kudos

05-07-2019 7:23:29 AM

2 More Replies

by rishigc • New Contributor

04-25-2019 9:43:45 AM

12292 Views
1 replies
0 kudos

Split a row into multiple rows based on a column value in Spark SQL

Hi, I am trying to split a record in a table to 2 records based on a column value. Please refer to the sample below. The input table displays the 3 types of Product and their price. Notice that for a specific Product (row) only its corresponding col...

Data Engineering

12292 Views
1 replies
0 kudos

04-25-2019 9:43:45 AM

View Replies

Latest Reply

mathan_pillai
Valued Contributor

04-26-2019 3:31:30 AM

0 kudos

Hi @rishigc You can use something like below. SELECT explode(arrays_zip(split(Product, '+'), split(Price, '+') ) as product_and_price from df or df.withColumn("product_and_price", explode(arrays_zip(split(Product, '+'), split(Price, '+'))).select( ...

0 kudos

04-26-2019 3:31:30 AM

by siddhu308 • New Contributor II

04-22-2019 1:36:13 AM

4918 Views
2 replies
0 kudos

column wise sum in PySpark dataframe

i have a dataframe of 18000000rows and 1322 column with '0' and '1' value. want to find how many '1's are in every column ??? below is DataSet se_00001 se_00007 se_00036 se_00100 se_0010p se_00250

Data Engineering

4918 Views
2 replies
0 kudos

04-22-2019 1:36:13 AM

View Replies

Latest Reply

mathan_pillai
Valued Contributor

04-23-2019 7:41:14 AM

0 kudos

Hi Siddhu, You can use df.select(sum("col1"), sum("col2"), sum("col3")) where col1, col2, col3 are the column names for which you would like to find the sum please let us know if it answers your question Thanks

0 kudos

04-23-2019 7:41:14 AM

1 More Replies

by Pascalvan_Belle • New Contributor

04-16-2019 11:50:04 PM

6433 Views
1 replies
0 kudos

How to create a surrogate key sequence which I can use in SCD cases?

Hi Community I would like to know if there is an option to create an integer sequence which persists even if the cluster is shut down. My target is to use this integer value as a surrogate key to join different tables or do Slowly changing dimensio...

Data Engineering

6433 Views
1 replies
0 kudos

04-16-2019 11:50:04 PM

View Replies

Latest Reply

girivaratharaja
New Contributor III

04-17-2019 2:43:39 PM

0 kudos

Hi @pascalvanbellen ,There is no concept of FK, PK, SK in Spark. But Databricks Delta automatically takes care of SCD type scenarios. https://docs.databricks.com/spark/latest/spark-sql/language-manual/merge-into.html#slowly-changing-data-scd-type-2 ...

0 kudos

04-17-2019 2:43:39 PM

by srchella • New Contributor

03-04-2019 11:58:17 PM

2185 Views
1 replies
0 kudos

How to take distinct of multiple columns ( > than 2 columns) in pyspark datafarme ?

I have 10+ columns and want to take distinct rows by multiple columns into consideration. How to achieve this using pyspark dataframe functions ?

Data Engineering

2185 Views
1 replies
0 kudos

03-04-2019 11:58:17 PM

View Replies

Latest Reply

Sandeep
Contributor III

03-28-2019 8:06:05 AM

0 kudos

You can use dropDuplicates https://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=distinct#pyspark.sql.DataFrame.dropDuplicates

0 kudos

03-28-2019 8:06:05 AM

by cfregly • Contributor

03-09-2015 5:27:56 PM

10636 Views
15 replies
0 kudos

What is the difference between registerTempTable() and saveAsTable()?

Data Engineering

10636 Views
15 replies
0 kudos

03-09-2015 5:27:56 PM

View Replies

Latest Reply

wildhogg
New Contributor II

03-28-2019 6:39:55 AM

0 kudos

Well, just a little bit research, and i found this post below: Hopefully this will help. " registerTempTable() registerTempTable() creates an in-memory table that is scoped to the cluster in which it was created. The data is stored using Hive's high...

0 kudos

03-28-2019 6:39:55 AM

14 More Replies

User

Count

1603

737

344

284

247

Databricks

Forum Posts

How can I create a Table from a CSV file with first column with data in dictionary format (JSON like)?

How to pass a python variables to shell script.?

How to set the timestamp format when reading CSV

DROP TABLE IF EXISTS does not work

Resolved! How to Change Schema of a Spark SQL

Databricks Delta - UPDATE error

Why Databricks spark is faster than AWS EMR Spark ?

"AmazonS3Exception: The bucket is in this region" error

How can I exit from a Notebook which is used as a job?

Write empty dataframe into csv

Split a row into multiple rows based on a column value in Spark SQL

column wise sum in PySpark dataframe

How to create a surrogate key sequence which I can use in SCD cases?

How to take distinct of multiple columns ( > than 2 columns) in pyspark datafarme ?

What is the difference between registerTempTable() and saveAsTable()?

External table from external location

How to increase executor memory in Databricks jobs

Databricks job keep getting failed due to executor...

Set up connection to on prem sql server

Git Integration with Databricks Query Files and Az...