Data Engineering

Forum Posts

Sorted by:

by desai_n_3 • New Contributor II

10-03-2019 1:41:05 AM

17126 Views
6 replies
0 kudos

Cannot Convert Column to Bool Error - When Converting dataframe column which is in string to date type in python

Hi All, I am trying to convert a dataframe column which is in the format of string to date type format yyyy-MM-DD? I have written a sql query and stored it in dataframe. df3 = sqlContext.sql(sqlString2) df3.withColumn(df3['CalDay'],pd.to_datetime(df...

Data Engineering

17126 Views
6 replies
0 kudos

10-03-2019 1:41:05 AM

View Replies

Latest Reply

JoshuaJames
New Contributor II

10-24-2019 11:06:21 PM

0 kudos

Registered to post this so forgive the formatting nightmare This is a python databricks script function that allows you to convert from string to datetime or date and utilising coalescefrom pyspark.sql.functions import coalesce, to_date def to_dat...

0 kudos

10-24-2019 11:06:21 PM

5 More Replies

by dbansal • New Contributor

10-14-2019 12:29:00 PM

15872 Views
1 replies
0 kudos

How can I add jars ("spark.jars") to pyspark notebook?

I want to add a few custom jars to the spark conf. Typically they would be submitted along with the spark-submit command but in Databricks notebook, the spark session is already initialized. So, I want to set the jars in "spark.jars" property in the...

Data Engineering

15872 Views
1 replies
0 kudos

10-14-2019 12:29:00 PM

View Replies

Latest Reply

shyam_9
Databricks Employee

10-14-2019 11:05:00 PM

0 kudos

Hi @dbansal, Install the libraries/jars while initialising the cluster.Please go through the documentation on the same below,https://docs.databricks.com/libraries.html#upload-a-jar-python-egg-or-python-wheel

0 kudos

10-14-2019 11:05:00 PM

by asher • New Contributor II

06-27-2019 11:09:19 AM

9826 Views
1 replies
0 kudos

List all files in a Blob Container

I am trying to find a way to list all files, and related file sizes, in all folders and all sub folders. I guess these are called blobs, in the Databricks world. Anyway, I can easily list all files, and related file sizes, in one single folder, but ...

Data Engineering

9826 Views
1 replies
0 kudos

06-27-2019 11:09:19 AM

View Replies

Latest Reply

asher
New Contributor II

10-14-2019 1:38:26 PM

0 kudos

from azure.storage.blob import BlockBlobService block_blob_service = BlockBlobService(account_name='your_acct_name', account_key='your_acct_key') mylist = [] generator = block_blob_service.list_blobs('rawdata') for blob in generator: mylist.append(...

0 kudos

10-14-2019 1:38:26 PM

by ammobear • New Contributor III

02-11-2019 3:43:52 PM

82515 Views
11 replies
6 kudos

Resolved! How do I get the current cluster id?

I am adding Application Insights telemetry to my Databricks jobs and would like to include the cluster ID of the job run. How can I access the cluster id at run time? The requirement is that my job can programmatically retrieve the cluster id to in...

Data Engineering

82515 Views
11 replies
6 kudos

02-11-2019 3:43:52 PM

View Replies

Latest Reply

EricBellet
New Contributor III

10-08-2019 9:06:35 AM

6 kudos

I fixed, it should be "'$DB_CLUSTER_ID'"

6 kudos

10-08-2019 9:06:35 AM

10 More Replies

by LaurentThiebaud • New Contributor

10-07-2019 12:01:20 AM

6893 Views
1 replies
0 kudos

Sort within a groupBy with dataframe

Using Spark DataFrame, eg. myDf .filter(col("timestamp").gt(15000)) .groupBy("groupingKey") .agg(collect_list("aDoubleValue")) I want the collect_list to return the result, but ordered according to "timestamp". i.a. I want the GroupBy results...

Data Engineering

6893 Views
1 replies
0 kudos

10-07-2019 12:01:20 AM

View Replies

Latest Reply

shyam_9
Databricks Employee

10-07-2019 1:43:59 AM

0 kudos

Hi @Laurent Thiebaud,Please use the below format to sort within a groupby, import org.apache.spark.sql.functions._ df.groupBy("columnA").agg(sort_array(collect_list("columnB")))

0 kudos

10-07-2019 1:43:59 AM

by RohiniMathur • New Contributor II

09-23-2019 11:03:39 AM

19142 Views
1 replies
0 kudos

Resolved! Length Value of a column in pyspark

Hello, i am using pyspark 2.12 After Creating Dataframe can we measure the length value for each row. For Example: I am measuring length of a value in column 2 Input file |TYCO|1303| |EMC |120989| |VOLVO|102329| |BMW|130157| |FORD|004| Output ...

Data Engineering

19142 Views
1 replies
0 kudos

09-23-2019 11:03:39 AM

View Replies

Latest Reply

lee
Contributor

09-23-2019 11:13:06 AM

0 kudos

You can use the length function for this from pyspark.sql.functions import length mock_data = [('TYCO', '1303'),('EMC', '120989'), ('VOLVO', '102329'),('BMW', '130157'),('FORD', '004')] df = spark.createDataFrame(mock_data, ['col1', 'col2']) df2 = d...

0 kudos

09-23-2019 11:13:06 AM

by RohiniMathur • New Contributor II

09-23-2019 12:16:16 AM

24411 Views
4 replies
0 kudos

Removing non-ascii and special character in pyspark

i am running spark 2.4.4 with python 2.7 and IDE is pycharm. The Input file (.csv) contain encoded value in some column like given below. File data looks COL1,COL2,COL3,COL4 CM, 503004, (d$όνυ$F|'.h*Λ!ψμ=(.ξ; ,.ʽ|!3-2-704 The output i am trying ...

Data Engineering

24411 Views
4 replies
0 kudos

09-23-2019 12:16:16 AM

View Replies

Latest Reply

shyam_9
Databricks Employee

09-23-2019 12:57:02 AM

0 kudos

Hi @Rohini Mathur, use below code on column containing non-ascii and special characters.df['column_name'].str.encode('ascii', 'ignore').str.decode('ascii')

0 kudos

09-23-2019 12:57:02 AM

3 More Replies

by akj2784 • New Contributor II

09-19-2019 12:05:10 AM

9526 Views
5 replies
0 kudos

How to create a dataframe with the files from S3 bucket

I have connected my S3 bucket from databricks. Using the following command : import urllib import urllib.parse ACCESS_KEY = "Test" SECRET_KEY = "Test" ENCODED_SECRET_KEY = urllib.parse.quote(SECRET_KEY, "") AWS_BUCKET_NAME = "Test" MOUNT_NAME = "...

Data Engineering

9526 Views
5 replies
0 kudos

09-19-2019 12:05:10 AM

View Replies

Latest Reply

shyam_9
Databricks Employee

09-19-2019 12:13:35 AM

0 kudos

Hi @akj2784,Please go through Databricks documentation on working with files in S3,https://docs.databricks.com/spark/latest/data-sources/aws/amazon-s3.html#mount-s3-buckets-with-dbfs

0 kudos

09-19-2019 12:13:35 AM

4 More Replies

by vaio • New Contributor II

11-18-2017 10:58:18 AM

8397 Views
6 replies
0 kudos

Convert String to Timestamp

I have a dataset with one column of string type ('2014/12/31 18:00:36'). How can I convert it to timastamp type with PySpark?

Data Engineering

8397 Views
6 replies
0 kudos

11-18-2017 10:58:18 AM

View Replies

Latest Reply

gideon
New Contributor II

09-19-2019 2:31:20 AM

0 kudos

hope you dont mind if i ask you to elaborate further for a shaper understanding? see my basketball court layout at https://www.recreationtipsy.com/basketball-court/

0 kudos

09-19-2019 2:31:20 AM

5 More Replies

by Raymond_Hu • New Contributor

09-16-2019 1:25:03 PM

16152 Views
1 replies
0 kudos

ConnectException error

I'm using PySpark on Databricks and trying to pivot a 27753444 X 3 matrix. If I do it in Spark DataFrame: df = df.groupBy("A").pivot("B").avg("C") it takes forever (after 2 hours and I canceled it). If I convert it to pandas dataframe and then pivo...

Data Engineering

16152 Views
1 replies
0 kudos

09-16-2019 1:25:03 PM

View Replies

Latest Reply

shyam_9
Databricks Employee

09-17-2019 3:59:31 AM

0 kudos

Hi @Raymond_Hu,This means that the driver crashed because of an OOM (Out of memory) exception and after that, it's not able to establish a new connection with the driver. Please try below optionsTry increasing driver-side memory and then retry.You ca...

0 kudos

09-17-2019 3:59:31 AM

by msj50 • New Contributor III

05-29-2015 7:49:19 AM

14133 Views
10 replies
1 kudos

Spark Running Really slow - help required

My company urgently needs help, we are having severe performance problems with spark and are having to switch to a different solution if we don't get to the bottom of it. We are on 1.3.1, using spark SQL, ORC Files with partitions and caching in me...

Data Engineering

14133 Views
10 replies
1 kudos

05-29-2015 7:49:19 AM

View Replies

Latest Reply

Marco
New Contributor II

11-02-2015 2:29:30 PM

1 kudos

In my project, following solutions were launched one-by-one to improve performance To store middle-level result, use memory cache instead of HDFS (like: Ignite Cache) Only use spark for complicated data aggregation, to simple result, just do it on d...

1 kudos

11-02-2015 2:29:30 PM

9 More Replies

by Yogi • New Contributor III

04-17-2019 4:50:09 AM

15657 Views
15 replies
0 kudos

Resolved! Can we pass Databricks output to Azure function body?

Hi, Can anyone help me with Databricks and Azure function. I'm trying to pass databricks json output to azure function body in ADF job, is it possible? If yes, How? If No, what other alternative to do the same?

Data Engineering

15657 Views
15 replies
0 kudos

04-17-2019 4:50:09 AM

View Replies

Latest Reply

AbhishekNarain_
New Contributor III

09-10-2019 9:02:02 PM

0 kudos

You can now pass values back to ADF from a notebook.@@Yogi Though there is a size limit, so if you are passing dataset of larger than 2MB then rather write it on storage, and consume it directly with Azure Functions. You can pass the file path/ refe...

0 kudos

09-10-2019 9:02:02 PM

14 More Replies

by sobhan • New Contributor II

08-07-2019 3:39:28 AM

10024 Views
3 replies
0 kudos

How can I write Pandas dataframe into avro

I am trying to write Pandas core dataframe into avro format as below. But I get the following error: AttributeError: 'DataFrame' object has no attribute 'write' I have tried several options as below: df_2018_pd.write.format("com.databricks.spark.avr...

Data Engineering

10024 Views
3 replies
0 kudos

08-07-2019 3:39:28 AM

View Replies

Latest Reply

Brayden_Cook
New Contributor II

08-28-2019 2:15:34 AM

0 kudos

Very complicated question. I think you can get your answer on online sites. There are many online providers like managements writing solutions whose experts provide online help for every type of research paper. I got a lot of assistance from them. No...

0 kudos

08-28-2019 2:15:34 AM

2 More Replies

by AdityaDeshpande • New Contributor II

08-25-2019 4:47:41 AM

6192 Views
2 replies
0 kudos

How to maintain Primary Key Column in Databricks Delta Multi Cluster environment

I am trying to replicate the SQL DB like feature of maintaining the Primary Keys in Databrciks Delta approach where the data is being written to Blob Storage such as ADLS2 oe AWS S3. I want a Auto Incremented Primary key feature using Databricks Del...

Data Engineering

6192 Views
2 replies
0 kudos

08-25-2019 4:47:41 AM

View Replies

Latest Reply

girivaratharaja
New Contributor III

08-26-2019 7:13:23 AM

0 kudos

Hi @Aditya Deshpande There is no locking mechanism of PK in Delta. You can use row_number() function on the df and save using delta and do a distinct() before the write.

0 kudos

08-26-2019 7:13:23 AM

1 More Replies

by xxMathieuxxZara • New Contributor

07-22-2015 1:15:47 PM

8067 Views
6 replies
0 kudos

Parquet file merging or other optimisation tips

Hi, I need some guide lines for a performance issue with Parquet files : I am loading a set of parquet files using : df = sqlContext.parquetFile( folder_path ) My parquet folder has 6 sub division keys It was initially ok with a first sample of data...

Data Engineering

8067 Views
6 replies
0 kudos

07-22-2015 1:15:47 PM

View Replies

Latest Reply

User16301467532
New Contributor II

07-24-2015 10:28:19 AM

0 kudos

Having a large # of small files or folders can significantly deteriorate the performance of loading the data. The best way is to keep the folders/files merged so that each file is around 64MB size. There are different ways to achieve this: your writ...

0 kudos

07-24-2015 10:28:19 AM

5 More Replies

Databricks Community

Forum Posts

Cannot Convert Column to Bool Error - When Converting dataframe column which is in string to date type in python

How can I add jars ("spark.jars") to pyspark notebook?

List all files in a Blob Container

Resolved! How do I get the current cluster id?

Sort within a groupBy with dataframe

Resolved! Length Value of a column in pyspark

Removing non-ascii and special character in pyspark

How to create a dataframe with the files from S3 bucket

Convert String to Timestamp

ConnectException error

Spark Running Really slow - help required

Resolved! Can we pass Databricks output to Azure function body?

How can I write Pandas dataframe into avro

How to maintain Primary Key Column in Databricks Delta Multi Cluster environment

Parquet file merging or other optimisation tips

Join Us as a Local Community Builder!

DAB + DLT destroy fails due to ownership/permissio...

Can't enable "variantType-preview" using DLTs

Liquid Clustering With Merge

deadlock occurs with use statement

is there another way to authen to azure databricks...