Data Engineering

Forum Posts

Sorted by:

by xxMathieuxxZara • New Contributor

07-22-2015 1:15:47 PM

3642 Views
6 replies
0 kudos

Parquet file merging or other optimisation tips

Hi, I need some guide lines for a performance issue with Parquet files : I am loading a set of parquet files using : df = sqlContext.parquetFile( folder_path ) My parquet folder has 6 sub division keys It was initially ok with a first sample of data...

Data Engineering

3642 Views
6 replies
0 kudos

07-22-2015 1:15:47 PM

View Replies

Latest Reply

User16301467532
New Contributor II

07-24-2015 10:28:19 AM

0 kudos

Having a large # of small files or folders can significantly deteriorate the performance of loading the data. The best way is to keep the folders/files merged so that each file is around 64MB size. There are different ways to achieve this: your writ...

0 kudos

07-24-2015 10:28:19 AM

5 More Replies

by MudassarA • New Contributor II

08-21-2019 3:15:22 PM

13428 Views
1 replies
0 kudos

NameError: name 'col' is not defined

I m executing the below code and using Pyhton in notebook and it appears that the col() function is not getting recognized . I want to know if the col() function belongs to any specific Dataframe library or Python library .I dont want to use pyspark...

Data Engineering

13428 Views
1 replies
0 kudos

08-21-2019 3:15:22 PM

View Replies

Latest Reply

MOHAN_KUMARL_N
New Contributor II

08-22-2019 2:18:12 AM

0 kudos

@mudassar45@gmail.com as the document describe generic column not yet associated. Please refer the below code. display(peopleDF.select("firstName").filter("firstName = 'An'"))

0 kudos

08-22-2019 2:18:12 AM

by AnilKumar • New Contributor II

11-24-2017 5:52:44 AM

6592 Views
4 replies
0 kudos

How to solve column header issues in Spark SQL data frame

My code : val name = sc.textFile("/FileStore/tables/employeenames.csv") case class x(ID:String,Employee_name:String) val namePairRDD = name.map(_.split(",")).map(x => (x(0), x(1).trim.toString)).toDF("ID", "Employee_name") namePairRDD.createOrRe...

Data Engineering

6592 Views
4 replies
0 kudos

11-24-2017 5:52:44 AM

View Replies

Latest Reply

evan_matthews1
New Contributor II

08-19-2019 8:25:08 PM

0 kudos

Hi, I have the opposite issue. When I run and SQL query through the bulk download as per the standard prc fobasx notebook, the first row of data somehow gets attached to the column headers. When I import the csv file into R using read_csv, R thinks ...

0 kudos

08-19-2019 8:25:08 PM

3 More Replies

by mashaye • New Contributor

11-10-2016 8:44:22 AM

17244 Views
6 replies
2 kudos

How can I call a stored procedure in Spark Sql?

I have seen the following code: val url = "jdbc:mysql://yourIP:yourPort/test? user=yourUsername; password=yourPassword" val df = sqlContext .read .format("jdbc") .option("url", url) .option("dbtable", "people") .load() But I ...

Data Engineering

17244 Views
6 replies
2 kudos

11-10-2016 8:44:22 AM

View Replies

Latest Reply

j500sut
New Contributor III

06-03-2019 8:34:16 PM

2 kudos

This doesn't seem to be supported. There is an alternative but requires using pyodbc and adding to your init script. Details can be found here: https://datathirst.net/blog/2018/10/12/executing-sql-server-stored-procedures-on-databricks-pyspark I hav...

2 kudos

06-03-2019 8:34:16 PM

5 More Replies

by tripplehay777 • New Contributor

09-01-2016 12:41:37 AM

10359 Views
1 replies
0 kudos

How can I create a Table from a CSV file with first column with data in dictionary format (JSON like)?

I have a csv file with the first column containing data in dictionary form (keys: value). [see below] I tried to create a table by uploading the csv file directly to databricks but the file can't be read. Is there a way for me to flatten or conver...

Data Engineering

10359 Views
1 replies
0 kudos

09-01-2016 12:41:37 AM

View Replies

Latest Reply

MaxStruever
New Contributor II

08-15-2019 12:37:19 PM

0 kudos

This is apparently a known issue, databricks has their own csv format handler which can handle this https://github.com/databricks/spark-csv SQL API CSV data source for Spark can infer data types: CREATE TABLE cars USING com.databricks.spark.csv OP...

0 kudos

08-15-2019 12:37:19 PM

by tonyp • New Contributor II

02-12-2019 8:29:22 PM

12803 Views
1 replies
1 kudos

How to pass a python variables to shell script.?

How to pass a python variables to shell script.in databricks notebook, The python parameters can passed from the 1 st cmd to next %sh cmd .?

Data Engineering

12803 Views
1 replies
1 kudos

02-12-2019 8:29:22 PM

View Replies

Latest Reply

erikvisser1
New Contributor II

08-14-2019 7:46:45 AM

1 kudos

I found the answer here: https://stackoverflow.com/questions/54662605/how-to-pass-a-python-variables-to-shell-script-in-azure-databricks-notebookbles basically: %python import os l =['A','B','C','D'] os.environ['LIST']=' '.join(l)print(os.getenv('L...

1 kudos

08-14-2019 7:46:45 AM

by EmilianoParizz1 • New Contributor

05-09-2019 11:24:29 AM

5159 Views
4 replies
0 kudos

How to set the timestamp format when reading CSV

I have a Databricks 5.3 cluster on Azure which runs Apache Spark 2.4.0 and Scala 2.11.I'm trying to parse a CSV file with a custom timestamp format but I don't know which datetime pattern format Spark uses.My CSV looks like this: Timestamp, Name, Va...

Data Engineering

5159 Views
4 replies
0 kudos

05-09-2019 11:24:29 AM

View Replies

Latest Reply

wellington72019
New Contributor II

08-12-2019 11:46:52 PM

0 kudos

# in python: explicitly define the schema, read in CSV data using the schema and a defined timestamp format.... <a href="http://thestoreguide.co.nz/auckland/orewa/mcdonalds-orewa-akl-0931/">McDonald’s in Orewa</a>

0 kudos

08-12-2019 11:46:52 PM

3 More Replies

by martinch • New Contributor II

03-01-2019 7:40:53 AM

8275 Views
4 replies
0 kudos

DROP TABLE IF EXISTS does not work

When I try to run the command spark.sql("DROP TABLE IF EXISTS table_to_drop") and the table does not exist, I get the following error: AnalysisException: "Table or view 'table_to_drop' not found in database 'null';;\nDropTableCommand `table_to_drop...

Data Engineering

8275 Views
4 replies
0 kudos

03-01-2019 7:40:53 AM

View Replies

Latest Reply

StevenWilliams
New Contributor II

07-30-2019 5:46:30 AM

0 kudos

I agree about this being a usability bug. Documentation clearly states that if the optional flag "IF EXISTS" is provided that the statement will do nothing.https://docs.databricks.com/spark/latest/spark-sql/language-manual/drop-table.htmlDrop Table ...

0 kudos

07-30-2019 5:46:30 AM

3 More Replies

by Dee • New Contributor

08-14-2018 10:21:15 PM

7254 Views
2 replies
0 kudos

Resolved! How to Change Schema of a Spark SQL

I am new to Spark and just started an online pyspark tutorial. I uploaded the json data in DataBrick and wrote the commands as follows: df = sqlContext.sql("SELECT * FROM people_json") df.printSchema() from pyspark.sql.types import * data_schema =...

Data Engineering

7254 Views
2 replies
0 kudos

08-14-2018 10:21:15 PM

View Replies

Latest Reply

bhanu2448
New Contributor II

07-20-2019 10:24:25 AM

0 kudos

http://www.bigdatainterview.com/

0 kudos

07-20-2019 10:24:25 AM

1 More Replies

by GuidoPereyra_ • New Contributor II

10-30-2018 10:35:01 AM

6244 Views
2 replies
0 kudos

Databricks Delta - UPDATE error

Hi, We got the following error when we tried to UPDATE a delta table running concurrent notebooks that all end with an update to the same table. " com.databricks.sql.transaction.tahoe.ConcurrentAppendException: Files were added matching 'true' by a ...

Data Engineering

6244 Views
2 replies
0 kudos

10-30-2018 10:35:01 AM

View Replies

Latest Reply

GuidoPereyra_
New Contributor II

06-21-2019 7:10:13 AM

0 kudos

Hi @matt@direction.consulting I just found the following doc https://docs.azuredatabricks.net/delta/isolation-level.html#set-the-isolation-level. In my case, I could fixed partitioning the table and I think is the only way for concurrent update in t...

0 kudos

06-21-2019 7:10:13 AM

1 More Replies

by kali_tummala • New Contributor II

06-06-2019 11:29:09 AM

5731 Views
5 replies
0 kudos

Why Databricks spark is faster than AWS EMR Spark ?

https://databricks.com/blog/2017/07/12/benchmarking-big-data-sql-platforms-in-the-cloud.html Hi All, just wondering why Databricks Spark is lot faster on S3 compared with AWS EMR spark both the systems are on spark version 2.4 , is Databricks have ...

Data Engineering

5731 Views
5 replies
0 kudos

06-06-2019 11:29:09 AM

View Replies

Latest Reply

RafiKurlansik
New Contributor III

06-11-2019 6:59:36 PM

0 kudos

I think you can get some pretty good insight into the optimizations on Databricks here:https://docs.databricks.com/delta/delta-on-databricks.html Specifically, check out the sections on caching, z-ordering, and join optimization. There's also a grea...

0 kudos

06-11-2019 6:59:36 PM

4 More Replies

by DanielAnderson • New Contributor

02-18-2018 3:57:10 AM

4467 Views
1 replies
0 kudos

"AmazonS3Exception: The bucket is in this region" error

I have read access to an S3 bucket in an AWS account that is not mine. For more than a year I've had a job successfully reading from that bucket using dbutils.fs.mount(...) and sqlContext.read.json(...). Recently the job started failing with the exc...

Data Engineering

4467 Views
1 replies
0 kudos

02-18-2018 3:57:10 AM

View Replies

Latest Reply

Chandan
New Contributor II

06-07-2019 11:59:23 PM

0 kudos

@andersource Looks like the bucket is in us-east-1 but you've configured your AmazonS3 Cloud platform with us-west-2. Can you try switching configuring the client to use us-east-1 ? I hope it will work for you. Thank you

0 kudos

06-07-2019 11:59:23 PM

by User16301465121 • New Contributor

04-18-2015 8:21:24 AM

8297 Views
3 replies
0 kudos

How can I exit from a Notebook which is used as a job?

How can I quit from a notebook in the middle of an execution based on some condition?

Data Engineering

8297 Views
3 replies
0 kudos

04-18-2015 8:21:24 AM

View Replies

Latest Reply

SamsonXia
New Contributor II

05-28-2019 11:32:20 AM

0 kudos

exit(value: String): voidCalling dbutils.notebook.exit in a job causes the notebook to complete successfully. If you want to cause the job to fail, throw an exception.

0 kudos

05-28-2019 11:32:20 AM

2 More Replies

by _not_provid1755 • New Contributor

03-18-2019 4:42:57 PM

4732 Views
3 replies
0 kudos

Write empty dataframe into csv

I'm writing my output (entity) data frame into csv file. Below statement works well when the data frame is non-empty. entity.repartition(1).write.mode(SaveMode.Overwrite).format("csv").option("header", "true").save(tempLocation) It's not working wh...

Data Engineering

4732 Views
3 replies
0 kudos

03-18-2019 4:42:57 PM

View Replies

Latest Reply

mrnov
New Contributor II

05-07-2019 7:23:29 AM

0 kudos

the same problem here (similar code and the same behavior with Spark 2.4.0, running with spark submit on Win and on Lin) dataset.coalesce(1) .write() .option("charset", "UTF-8") .option("header", "true") .mode(SaveMod...

0 kudos

05-07-2019 7:23:29 AM

2 More Replies

by rishigc • New Contributor

04-25-2019 9:43:45 AM

12063 Views
1 replies
0 kudos

Split a row into multiple rows based on a column value in Spark SQL

Hi, I am trying to split a record in a table to 2 records based on a column value. Please refer to the sample below. The input table displays the 3 types of Product and their price. Notice that for a specific Product (row) only its corresponding col...

Data Engineering

12063 Views
1 replies
0 kudos

04-25-2019 9:43:45 AM

View Replies

Latest Reply

mathan_pillai
Valued Contributor

04-26-2019 3:31:30 AM

0 kudos

Hi @rishigc You can use something like below. SELECT explode(arrays_zip(split(Product, '+'), split(Price, '+') ) as product_and_price from df or df.withColumn("product_and_price", explode(arrays_zip(split(Product, '+'), split(Price, '+'))).select( ...

0 kudos

04-26-2019 3:31:30 AM

User

Count

1602

736

344

284

247

Databricks

Forum Posts

Parquet file merging or other optimisation tips

NameError: name 'col' is not defined

How to solve column header issues in Spark SQL data frame

How can I call a stored procedure in Spark Sql?

How can I create a Table from a CSV file with first column with data in dictionary format (JSON like)?

How to pass a python variables to shell script.?

How to set the timestamp format when reading CSV

DROP TABLE IF EXISTS does not work

Resolved! How to Change Schema of a Spark SQL

Databricks Delta - UPDATE error

Why Databricks spark is faster than AWS EMR Spark ?

"AmazonS3Exception: The bucket is in this region" error

How can I exit from a Notebook which is used as a job?

Write empty dataframe into csv

Split a row into multiple rows based on a column value in Spark SQL

Best way to parse Google Analytics data in Databri...

DELTA_EXCEED_CHAR_VARCHAR_LIMIT

Not able to set run_as service_principal_name

Pyspark operations slowness in CLuster 14.3LTS as ...

[Databricks Assets Bundles] Workflow trigger on fi...