Data Engineering

Forum Posts

Sorted by:

by Yogi • New Contributor III

04-17-2019 4:50:09 AM

15602 Views
15 replies
0 kudos

Resolved! Can we pass Databricks output to Azure function body?

Hi, Can anyone help me with Databricks and Azure function. I'm trying to pass databricks json output to azure function body in ADF job, is it possible? If yes, How? If No, what other alternative to do the same?

Data Engineering

15602 Views
15 replies
0 kudos

04-17-2019 4:50:09 AM

View Replies

Latest Reply

AbhishekNarain_
New Contributor III

09-10-2019 9:02:02 PM

0 kudos

You can now pass values back to ADF from a notebook.@@Yogi Though there is a size limit, so if you are passing dataset of larger than 2MB then rather write it on storage, and consume it directly with Azure Functions. You can pass the file path/ refe...

0 kudos

09-10-2019 9:02:02 PM

14 More Replies

by sobhan • New Contributor II

08-07-2019 3:39:28 AM

10006 Views
3 replies
0 kudos

How can I write Pandas dataframe into avro

I am trying to write Pandas core dataframe into avro format as below. But I get the following error: AttributeError: 'DataFrame' object has no attribute 'write' I have tried several options as below: df_2018_pd.write.format("com.databricks.spark.avr...

Data Engineering

10006 Views
3 replies
0 kudos

08-07-2019 3:39:28 AM

View Replies

Latest Reply

Brayden_Cook
New Contributor II

08-28-2019 2:15:34 AM

0 kudos

Very complicated question. I think you can get your answer on online sites. There are many online providers like managements writing solutions whose experts provide online help for every type of research paper. I got a lot of assistance from them. No...

0 kudos

08-28-2019 2:15:34 AM

2 More Replies

by AdityaDeshpande • New Contributor II

08-25-2019 4:47:41 AM

6175 Views
2 replies
0 kudos

How to maintain Primary Key Column in Databricks Delta Multi Cluster environment

I am trying to replicate the SQL DB like feature of maintaining the Primary Keys in Databrciks Delta approach where the data is being written to Blob Storage such as ADLS2 oe AWS S3. I want a Auto Incremented Primary key feature using Databricks Del...

Data Engineering

6175 Views
2 replies
0 kudos

08-25-2019 4:47:41 AM

View Replies

Latest Reply

girivaratharaja
New Contributor III

08-26-2019 7:13:23 AM

0 kudos

Hi @Aditya Deshpande There is no locking mechanism of PK in Delta. You can use row_number() function on the df and save using delta and do a distinct() before the write.

0 kudos

08-26-2019 7:13:23 AM

1 More Replies

by xxMathieuxxZara • New Contributor

07-22-2015 1:15:47 PM

8042 Views
6 replies
0 kudos

Parquet file merging or other optimisation tips

Hi, I need some guide lines for a performance issue with Parquet files : I am loading a set of parquet files using : df = sqlContext.parquetFile( folder_path ) My parquet folder has 6 sub division keys It was initially ok with a first sample of data...

Data Engineering

8042 Views
6 replies
0 kudos

07-22-2015 1:15:47 PM

View Replies

Latest Reply

User16301467532
New Contributor II

07-24-2015 10:28:19 AM

0 kudos

Having a large # of small files or folders can significantly deteriorate the performance of loading the data. The best way is to keep the folders/files merged so that each file is around 64MB size. There are different ways to achieve this: your writ...

0 kudos

07-24-2015 10:28:19 AM

5 More Replies

by Maser_AZ • New Contributor II

08-21-2019 3:15:22 PM

18654 Views
1 replies
0 kudos

NameError: name 'col' is not defined

I m executing the below code and using Pyhton in notebook and it appears that the col() function is not getting recognized . I want to know if the col() function belongs to any specific Dataframe library or Python library .I dont want to use pyspark...

Data Engineering

18654 Views
1 replies
0 kudos

08-21-2019 3:15:22 PM

View Replies

Latest Reply

MOHAN_KUMARL_N
New Contributor II

08-22-2019 2:18:12 AM

0 kudos

@mudassar45@gmail.com as the document describe generic column not yet associated. Please refer the below code. display(peopleDF.select("firstName").filter("firstName = 'An'"))

0 kudos

08-22-2019 2:18:12 AM

by AnilKumar • New Contributor II

11-24-2017 5:52:44 AM

12371 Views
4 replies
0 kudos

How to solve column header issues in Spark SQL data frame

My code : val name = sc.textFile("/FileStore/tables/employeenames.csv") case class x(ID:String,Employee_name:String) val namePairRDD = name.map(_.split(",")).map(x => (x(0), x(1).trim.toString)).toDF("ID", "Employee_name") namePairRDD.createOrRe...

Data Engineering

12371 Views
4 replies
0 kudos

11-24-2017 5:52:44 AM

View Replies

Latest Reply

evan_matthews1
New Contributor II

08-19-2019 8:25:08 PM

0 kudos

Hi, I have the opposite issue. When I run and SQL query through the bulk download as per the standard prc fobasx notebook, the first row of data somehow gets attached to the column headers. When I import the csv file into R using read_csv, R thinks ...

0 kudos

08-19-2019 8:25:08 PM

3 More Replies

by mashaye • New Contributor

11-10-2016 8:44:22 AM

26893 Views
6 replies
2 kudos

How can I call a stored procedure in Spark Sql?

I have seen the following code: val url = "jdbc:mysql://yourIP:yourPort/test? user=yourUsername; password=yourPassword" val df = sqlContext .read .format("jdbc") .option("url", url) .option("dbtable", "people") .load() But I ...

Data Engineering

26893 Views
6 replies
2 kudos

11-10-2016 8:44:22 AM

View Replies

Latest Reply

j500sut
New Contributor III

06-03-2019 8:34:16 PM

2 kudos

This doesn't seem to be supported. There is an alternative but requires using pyodbc and adding to your init script. Details can be found here: https://datathirst.net/blog/2018/10/12/executing-sql-server-stored-procedures-on-databricks-pyspark I hav...

2 kudos

06-03-2019 8:34:16 PM

5 More Replies

by tripplehay777 • New Contributor

09-01-2016 12:41:37 AM

18978 Views
1 replies
0 kudos

How can I create a Table from a CSV file with first column with data in dictionary format (JSON like)?

I have a csv file with the first column containing data in dictionary form (keys: value). [see below] I tried to create a table by uploading the csv file directly to databricks but the file can't be read. Is there a way for me to flatten or conver...

Data Engineering

18978 Views
1 replies
0 kudos

09-01-2016 12:41:37 AM

View Replies

Latest Reply

MaxStruever
New Contributor II

08-15-2019 12:37:19 PM

0 kudos

This is apparently a known issue, databricks has their own csv format handler which can handle this https://github.com/databricks/spark-csv SQL API CSV data source for Spark can infer data types: CREATE TABLE cars USING com.databricks.spark.csv OP...

0 kudos

08-15-2019 12:37:19 PM

by tonyp • New Contributor II

02-12-2019 8:29:22 PM

17935 Views
1 replies
1 kudos

How to pass a python variables to shell script.?

How to pass a python variables to shell script.in databricks notebook, The python parameters can passed from the 1 st cmd to next %sh cmd .?

Data Engineering

17935 Views
1 replies
1 kudos

02-12-2019 8:29:22 PM

View Replies

Latest Reply

erikvisser1
New Contributor II

08-14-2019 7:46:45 AM

1 kudos

I found the answer here: https://stackoverflow.com/questions/54662605/how-to-pass-a-python-variables-to-shell-script-in-azure-databricks-notebookbles basically: %python import os l =['A','B','C','D'] os.environ['LIST']=' '.join(l)print(os.getenv('L...

1 kudos

08-14-2019 7:46:45 AM

by EmilianoParizz1 • New Contributor

05-09-2019 11:24:29 AM

11304 Views
4 replies
0 kudos

How to set the timestamp format when reading CSV

I have a Databricks 5.3 cluster on Azure which runs Apache Spark 2.4.0 and Scala 2.11.I'm trying to parse a CSV file with a custom timestamp format but I don't know which datetime pattern format Spark uses.My CSV looks like this: Timestamp, Name, Va...

Data Engineering

11304 Views
4 replies
0 kudos

05-09-2019 11:24:29 AM

View Replies

Latest Reply

wellington72019
New Contributor II

08-12-2019 11:46:52 PM

0 kudos

# in python: explicitly define the schema, read in CSV data using the schema and a defined timestamp format.... <a href="http://thestoreguide.co.nz/auckland/orewa/mcdonalds-orewa-akl-0931/">McDonald’s in Orewa</a>

0 kudos

08-12-2019 11:46:52 PM

3 More Replies

by martinch • New Contributor II

03-01-2019 7:40:53 AM

22397 Views
4 replies
0 kudos

DROP TABLE IF EXISTS does not work

When I try to run the command spark.sql("DROP TABLE IF EXISTS table_to_drop") and the table does not exist, I get the following error: AnalysisException: "Table or view 'table_to_drop' not found in database 'null';;\nDropTableCommand `table_to_drop...

Data Engineering

22397 Views
4 replies
0 kudos

03-01-2019 7:40:53 AM

View Replies

Latest Reply

StevenWilliams
New Contributor II

07-30-2019 5:46:30 AM

0 kudos

I agree about this being a usability bug. Documentation clearly states that if the optional flag "IF EXISTS" is provided that the statement will do nothing.https://docs.databricks.com/spark/latest/spark-sql/language-manual/drop-table.htmlDrop Table ...

0 kudos

07-30-2019 5:46:30 AM

3 More Replies

by Dee • New Contributor

08-14-2018 10:21:15 PM

12534 Views
2 replies
0 kudos

Resolved! How to Change Schema of a Spark SQL

I am new to Spark and just started an online pyspark tutorial. I uploaded the json data in DataBrick and wrote the commands as follows: df = sqlContext.sql("SELECT * FROM people_json") df.printSchema() from pyspark.sql.types import * data_schema =...

Data Engineering

12534 Views
2 replies
0 kudos

08-14-2018 10:21:15 PM

View Replies

Latest Reply

bhanu2448
New Contributor II

07-20-2019 10:24:25 AM

0 kudos

http://www.bigdatainterview.com/

0 kudos

07-20-2019 10:24:25 AM

1 More Replies

by GuidoPereyra_ • New Contributor II

10-30-2018 10:35:01 AM

7965 Views
2 replies
0 kudos

Databricks Delta - UPDATE error

Hi, We got the following error when we tried to UPDATE a delta table running concurrent notebooks that all end with an update to the same table. " com.databricks.sql.transaction.tahoe.ConcurrentAppendException: Files were added matching 'true' by a ...

Data Engineering

7965 Views
2 replies
0 kudos

10-30-2018 10:35:01 AM

View Replies

Latest Reply

GuidoPereyra_
New Contributor II

06-21-2019 7:10:13 AM

0 kudos

Hi @matt@direction.consulting I just found the following doc https://docs.azuredatabricks.net/delta/isolation-level.html#set-the-isolation-level. In my case, I could fixed partitioning the table and I think is the only way for concurrent update in t...

0 kudos

06-21-2019 7:10:13 AM

1 More Replies

by kali_tummala • New Contributor II

06-06-2019 11:29:09 AM

10982 Views
5 replies
0 kudos

Why Databricks spark is faster than AWS EMR Spark ?

https://databricks.com/blog/2017/07/12/benchmarking-big-data-sql-platforms-in-the-cloud.html Hi All, just wondering why Databricks Spark is lot faster on S3 compared with AWS EMR spark both the systems are on spark version 2.4 , is Databricks have ...

Data Engineering

10982 Views
5 replies
0 kudos

06-06-2019 11:29:09 AM

View Replies

Latest Reply

RafiKurlansik
Databricks Employee

06-11-2019 6:59:36 PM

0 kudos

I think you can get some pretty good insight into the optimizations on Databricks here:https://docs.databricks.com/delta/delta-on-databricks.html Specifically, check out the sections on caching, z-ordering, and join optimization. There's also a grea...

0 kudos

06-11-2019 6:59:36 PM

4 More Replies

by DanielAnderson • New Contributor

02-18-2018 3:57:10 AM

7036 Views
1 replies
0 kudos

"AmazonS3Exception: The bucket is in this region" error

I have read access to an S3 bucket in an AWS account that is not mine. For more than a year I've had a job successfully reading from that bucket using dbutils.fs.mount(...) and sqlContext.read.json(...). Recently the job started failing with the exc...

Data Engineering

7036 Views
1 replies
0 kudos

02-18-2018 3:57:10 AM

View Replies

Latest Reply

Chandan
New Contributor II

06-07-2019 11:59:23 PM

0 kudos

@andersource Looks like the bucket is in us-east-1 but you've configured your AmazonS3 Cloud platform with us-west-2. Can you try switching configuring the client to use us-east-1 ? I hope it will work for you. Thank you

0 kudos

06-07-2019 11:59:23 PM

Databricks Community

Forum Posts

Resolved! Can we pass Databricks output to Azure function body?

How can I write Pandas dataframe into avro

How to maintain Primary Key Column in Databricks Delta Multi Cluster environment

Parquet file merging or other optimisation tips

NameError: name 'col' is not defined

How to solve column header issues in Spark SQL data frame

How can I call a stored procedure in Spark Sql?

How can I create a Table from a CSV file with first column with data in dictionary format (JSON like)?

How to pass a python variables to shell script.?

How to set the timestamp format when reading CSV

DROP TABLE IF EXISTS does not work

Resolved! How to Change Schema of a Spark SQL

Databricks Delta - UPDATE error

Why Databricks spark is faster than AWS EMR Spark ?

"AmazonS3Exception: The bucket is in this region" error

Join Us as a Local Community Builder!

Set default tblproperties for pipeline

AttributeError: module 'numpy' has no attribute 't...

Error occurs on create materialized view with spar...

How to create parameters that works in Power BI Re...

Data profiling monitoring with foreign catalog