Data Engineering

Forum Posts

Sorted by:

by Muthu145 • New Contributor

12-19-2016 4:50:09 PM

9946 Views
3 replies
0 kudos

KNN classifier on Spark

Hi Team , Can you please help me in implementing KNN classifer in pyspark using distributed architecture and processing the dataset. Even I want to validate the KNN model with the testing dataset. I tried to use scikit learn but the program is runn...

Data Engineering

9946 Views
3 replies
0 kudos

12-19-2016 4:50:09 PM

View Replies

Latest Reply

SouravSaha
New Contributor II

02-04-2020 6:31:46 PM

0 kudos

Hey, about about using NEC Frovedis (https://github.com/frovedis/frovedis) framework for the same. Refer: https://github.com/frovedis/frovedis/blob/master/src/foreign_if/python/examples/unsupervised_knn_demo.py It works on a distributed framework (...

0 kudos

02-04-2020 6:31:46 PM

2 More Replies

by zhaoxuan210 • New Contributor

01-16-2020 9:12:50 AM

18412 Views
1 replies
0 kudos

How can I read all the files in a folder on S3 into several pandas dataframes?

import pandas as pd import glob path = "s3://somewhere/" # use your path all_files = glob.glob(path + "/*.csv") print(all_files) li = [] for filename in all_files: dfi = pd.read_csv(filename,names =['acct_id', 'SOR_ID'], dtype={'acct_id':str,...

Data Engineering

18412 Views
1 replies
0 kudos

01-16-2020 9:12:50 AM

View Replies

Latest Reply

shyam_9
Valued Contributor

01-26-2020 10:03:43 PM

0 kudos

Hi @zhaoxuan210, Please go through the below answer,https://stackoverflow.com/questions/52855221/reading-multiple-csv-files-from-s3-bucket-with-boto3

0 kudos

01-26-2020 10:03:43 PM

by cfregly • Contributor

02-24-2015 3:40:15 PM

3668 Views
6 replies
0 kudos

Should I always cache my RDD's and DataFrames?

Data Engineering

3668 Views
6 replies
0 kudos

02-24-2015 3:40:15 PM

View Replies

Latest Reply

ThomasDecaux
New Contributor II

04-11-2017 2:24:38 AM

0 kudos

Hello, What is most efficient between RDD and DataFrame ? (I mean better to cache, consume less memory) Thanks you,

0 kudos

04-11-2017 2:24:38 AM

5 More Replies

by Van-DuyetLe • New Contributor

03-21-2018 12:38:32 AM

24190 Views
5 replies
1 kudos

What's the difference between Interactive Clusters and Job Cluster?

I am new to databricks. I would like to know what is the difference between Interactive Clusters and Job Cluster? There are no official document now.

Data Engineering

24190 Views
5 replies
1 kudos

03-21-2018 12:38:32 AM

View Replies

Latest Reply

Forum_Admin
Contributor

01-19-2020 6:53:28 AM

1 kudos

Sports news Football news International football news Football news Thai football news, Thai football Follow news, know sports news at Siamsportnews

1 kudos

01-19-2020 6:53:28 AM

4 More Replies

by User16301467532 • New Contributor II

07-15-2015 11:45:24 AM

17175 Views
9 replies
1 kudos

How can I change the parquet compression algorithm from gzip to something else?

Spark, by default, uses gzip to store parquet files. I would like to change the compression algorithm from gzip to snappy or lz4.

Data Engineering

17175 Views
9 replies
1 kudos

07-15-2015 11:45:24 AM

View Replies

Latest Reply

ZhenZeng
New Contributor II

10-01-2019 2:10:05 AM

1 kudos

spark.sql("set spark.sql.parquet.compression.codec=gzip");

1 kudos

10-01-2019 2:10:05 AM

8 More Replies

by Venkata_Krishna • New Contributor

01-13-2020 11:04:43 AM

7848 Views
1 replies
0 kudos

convert string dataframe column MM/dd/yyyy hh:mm:ss AM/PM to timestamp MM-dd-yyyy hh:mm:ss

How to convert string 6/3/2019 5:06:00 AM to timestamp in 24 hour format MM-dd-yyyy hh:mm:ss in python spark.

Data Engineering

7848 Views
1 replies
0 kudos

01-13-2020 11:04:43 AM

View Replies

Latest Reply

lee
Contributor

01-13-2020 5:17:35 PM

0 kudos

You would use a combination of the functions: pyspark.sql.functions.from_unixtime(timestamp, format='yyyy-MM-dd HH:mm:ss') (documentation) and pyspark.sql.functions.unix_timestamp(timestamp=None, format='yyyy-MM-dd HH:mm:ss') (documentation)from pysp...

0 kudos

01-13-2020 5:17:35 PM

by MithuWagh • New Contributor

12-24-2019 4:14:09 AM

5307 Views
1 replies
0 kudos

How to deal with column name with .(dot) in pyspark dataframe??

We are streaming data from kafka source with json but in some column we are getting .(dot) in column names.streaming json data: df1 = df.selectExpr("CAST(value AS STRING)") {"pNum":"A14","from":"telecom","payload":{"TARGET":"1","COUNTRY":"India"...

Data Engineering

5307 Views
1 replies
0 kudos

12-24-2019 4:14:09 AM

View Replies

Latest Reply

shyam_9
Valued Contributor

12-30-2019 3:27:03 AM

0 kudos

Hi @Mithu Wagh you can use backticks to enclose the column name.df.select("`col0.1`")

0 kudos

12-30-2019 3:27:03 AM

by KrisMusial • New Contributor

07-31-2016 11:07:50 AM

5313 Views
2 replies
0 kudos

Resolved! Saving to parquet with SaveMode.Overwrite throws exception

Hello, I'm trying to save DataFrame in parquet with SaveMode.Overwrite with no success. I minimized the code and reproduced the issue with the following two cells: > case class MyClass(val fld1: Integer, val fld2: Integer) > > val lst1 = sc.paralle...

Data Engineering

5313 Views
2 replies
0 kudos

07-31-2016 11:07:50 AM

View Replies

Latest Reply

Guru421421
New Contributor II

12-20-2019 1:59:16 PM

0 kudos

results.select("ValidationTable", "Results","Description","CreatedBy","ModifiedBy","CreatedDate","ModifiedDate").write.mode('overwrite').save("

0 kudos

12-20-2019 1:59:16 PM

1 More Replies

by NandhaKumar • New Contributor II

11-14-2019 2:34:57 AM

3604 Views
3 replies
0 kudos

How to specify multiple files in --py-files in spark-submit command for databricks job? All the files to be specified in --py-files present in dbfs: .

I have created a databricks in azure. I have created a cluster for python 3. I am creating a job using spark-submit parameters. How to specify multiple files in --py-files in spark-submit command for databricks job? All the files to be specified in ...

Data Engineering

3604 Views
3 replies
0 kudos

11-14-2019 2:34:57 AM

View Replies

Latest Reply

shyam_9
Valued Contributor

11-17-2019 9:46:20 PM

0 kudos

Hi @Nandha Kumar,please go through the below docs to pass python files as job,https://docs.databricks.com/dev-tools/api/latest/jobs.html#sparkpythontask

0 kudos

11-17-2019 9:46:20 PM

2 More Replies

by cfregly • Contributor

05-26-2015 11:38:48 AM

2765 Views
4 replies
0 kudos

How do I group my dataset by a key or combination of keys without doing any aggregations using RDDs, DataFrames, and SQL?

Data Engineering

2765 Views
4 replies
0 kudos

05-26-2015 11:38:48 AM

View Replies

Latest Reply

GeethGovindSrin
New Contributor II

12-19-2019 2:47:04 AM

0 kudos

@cfregly : For DataFrames, you can use the following code for using groupBy without aggregations.Df.groupBy(Df["column_name"]).agg({})

0 kudos

12-19-2019 2:47:04 AM

3 More Replies

by tourist_on_road • New Contributor

12-12-2019 4:47:16 PM

4702 Views
1 replies
0 kudos

How to read binary data in pyspark

I'm reading binary file http://snap.stanford.edu/data/amazon/productGraph/image_features/image_features.b using pyspark.from io importStringIO import array img_embedding_file = sc.binaryRecords("s3://bucket/image_features.b",4106)def mapper(featur...

Data Engineering

4702 Views
1 replies
0 kudos

12-12-2019 4:47:16 PM

View Replies

Latest Reply

shyam_9
Valued Contributor

12-16-2019 10:00:26 PM

0 kudos

Hi @tourist_on_road, please go through the below spark docs,https://spark.apache.org/docs/2.3.0/api/python/pyspark.html#pyspark.SparkContext.binaryFiles

0 kudos

12-16-2019 10:00:26 PM

by naveenreddy1 • New Contributor II

11-21-2019 8:40:58 PM

16658 Views
3 replies
0 kudos

Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages. Driver stacktrace

We are using the databricks 3 node cluster with 32 GB memory. It is working fine but some times it automatically throwing the error: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues.

Data Engineering

16658 Views
3 replies
0 kudos

11-21-2019 8:40:58 PM

View Replies

Latest Reply

RodrigoDe_Freit
New Contributor II

12-10-2019 11:55:58 AM

0 kudos

If your job fails follow this:According to https://docs.databricks.com/jobs.html#jar-job-tips: "Job output, such as log output emitted to stdout, is subject to a 20MB size limit. If the total output has a larger size, the run will be canceled and ma...

0 kudos

12-10-2019 11:55:58 AM

2 More Replies

by MikeK_ • New Contributor II

11-29-2019 11:32:28 AM

13160 Views
1 replies
0 kudos

Resolved! SQL variables in a notebook

Hi, In an SQL notebook, using this link: https://docs.databricks.com/spark/latest/spark-sql/language-manual/set.html I managed to figure out to set values and how to get the value. SET my_val=10; //saves the value 10 for key my_val SET my_val; //dis...

Data Engineering

13160 Views
1 replies
0 kudos

11-29-2019 11:32:28 AM

View Replies

Latest Reply

shyam_9
Valued Contributor

12-01-2019 11:38:37 PM

0 kudos

Hi @Mike K.., you can do this with widgets and getArgument. Here's a small example of what that might look like: https://community.databricks.com/s/feed/0D53f00001HKHZfCAP

0 kudos

12-01-2019 11:38:37 PM

by kruhly • New Contributor II

05-12-2015 3:29:18 AM

27467 Views
12 replies
0 kudos

Resolved! Is there a better method to join two dataframes and not have a duplicated column?

I would like to keep only one of the columns used to join the dataframes. Using select() after the join does not seem straight forward because the real data may have many columns or the column names may not be known. A simple example belowllist = [(...

Data Engineering

27467 Views
12 replies
0 kudos

05-12-2015 3:29:18 AM

View Replies

Latest Reply

TejuNC
New Contributor II

01-23-2017 1:55:52 AM

0 kudos

This is an expected behavior. DataFrame.join method is equivalent to SQL join like thisSELECT*FROM a JOIN b ON joinExprsIf you want to ignore duplicate columns just drop them or select columns of interest afterwards. If you want to disambiguate you c...

0 kudos

01-23-2017 1:55:52 AM

11 More Replies

by Pierrek20 • New Contributor

10-11-2018 4:59:22 AM

13014 Views
2 replies
0 kudos

How to loop over spark dataframe with scala ?

Hello ! I 'm rookie to spark scala, here is my problem : tk's in advance for your help my input dataframe looks like this : index bucket time ap station rssi 0 1 00:00 1 1 -84.0 1 1 00:00 1 3 -67.0 2 1 00:00 1 4 -82.0 3 1 00:00 1 2 -68.0 4 1 00:00...

Data Engineering

13014 Views
2 replies
0 kudos

10-11-2018 4:59:22 AM

View Replies

Latest Reply

Eve
New Contributor III

11-19-2019 1:53:57 AM

0 kudos

Looping is not always necessary, I always use this foreach method, something like the following: aps.collect().foreach(row => <do something>)

0 kudos

11-19-2019 1:53:57 AM

1 More Replies

User

Count

1604

737

344

284

247

Databricks

Forum Posts

KNN classifier on Spark

How can I read all the files in a folder on S3 into several pandas dataframes?

Should I always cache my RDD's and DataFrames?

What's the difference between Interactive Clusters and Job Cluster?

How can I change the parquet compression algorithm from gzip to something else?

convert string dataframe column MM/dd/yyyy hh:mm:ss AM/PM to timestamp MM-dd-yyyy hh:mm:ss

How to deal with column name with .(dot) in pyspark dataframe??

Resolved! Saving to parquet with SaveMode.Overwrite throws exception

How to specify multiple files in --py-files in spark-submit command for databricks job? All the files to be specified in --py-files present in dbfs: .

How do I group my dataset by a key or combination of keys without doing any aggregations using RDDs, DataFrames, and SQL?

How to read binary data in pyspark

Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages. Driver stacktrace

Resolved! SQL variables in a notebook

Resolved! Is there a better method to join two dataframes and not have a duplicated column?

How to loop over spark dataframe with scala ?

Databricks Expectations

DataBricks Auto loader vs input source files delet...

Migration Azure to AWS

Advice for generic file processing for ingestion o...

How to obtain a query profile programatically?