- 9946 Views
- 3 replies
- 0 kudos
Hi Team ,
Can you please help me in implementing KNN classifer in pyspark using distributed architecture and processing the dataset.
Even I want to validate the KNN model with the testing dataset.
I tried to use scikit learn but the program is runn...
- 9946 Views
- 3 replies
- 0 kudos
Latest Reply
Hey, about about using NEC Frovedis (https://github.com/frovedis/frovedis) framework for the same.
Refer: https://github.com/frovedis/frovedis/blob/master/src/foreign_if/python/examples/unsupervised_knn_demo.py
It works on a distributed framework (...
2 More Replies
- 18412 Views
- 1 replies
- 0 kudos
import pandas as pd
import glob
path = "s3://somewhere/" # use your path
all_files = glob.glob(path + "/*.csv")
print(all_files)
li = []
for filename in all_files:
dfi = pd.read_csv(filename,names =['acct_id', 'SOR_ID'], dtype={'acct_id':str,...
- 18412 Views
- 1 replies
- 0 kudos
Latest Reply
Hi @zhaoxuan210, Please go through the below answer,https://stackoverflow.com/questions/52855221/reading-multiple-csv-files-from-s3-bucket-with-boto3
- 24190 Views
- 5 replies
- 1 kudos
I am new to databricks. I would like to know what is the difference between Interactive Clusters and Job Cluster? There are no official document now.
- 24190 Views
- 5 replies
- 1 kudos
Latest Reply
Sports news Football news International football news
Football news Thai football news, Thai football
Follow news, know sports news at Siamsportnews
4 More Replies
- 17175 Views
- 9 replies
- 1 kudos
Spark, by default, uses gzip to store parquet files. I would like to change the compression algorithm from gzip to snappy or lz4.
- 17175 Views
- 9 replies
- 1 kudos
Latest Reply
spark.sql("set spark.sql.parquet.compression.codec=gzip");
8 More Replies
- 7848 Views
- 1 replies
- 0 kudos
How to convert string 6/3/2019 5:06:00 AM to timestamp in 24 hour format MM-dd-yyyy hh:mm:ss in python spark.
- 7848 Views
- 1 replies
- 0 kudos
Latest Reply
You would use a combination of the functions: pyspark.sql.functions.from_unixtime(timestamp, format='yyyy-MM-dd HH:mm:ss') (documentation) and pyspark.sql.functions.unix_timestamp(timestamp=None, format='yyyy-MM-dd HH:mm:ss') (documentation)from pysp...
- 5307 Views
- 1 replies
- 0 kudos
We are streaming data from kafka source with json but in some column we are getting .(dot) in column names.streaming json data:
df1 = df.selectExpr("CAST(value AS STRING)")
{"pNum":"A14","from":"telecom","payload":{"TARGET":"1","COUNTRY":"India"...
- 5307 Views
- 1 replies
- 0 kudos
Latest Reply
Hi @Mithu Wagh you can use backticks to enclose the column name.df.select("`col0.1`")
- 5313 Views
- 2 replies
- 0 kudos
Hello, I'm trying to save DataFrame in parquet with SaveMode.Overwrite with no success.
I minimized the code and reproduced the issue with the following two cells:
> case class MyClass(val fld1: Integer, val fld2: Integer)
>
> val lst1 = sc.paralle...
- 5313 Views
- 2 replies
- 0 kudos
Latest Reply
results.select("ValidationTable", "Results","Description","CreatedBy","ModifiedBy","CreatedDate","ModifiedDate").write.mode('overwrite').save("
1 More Replies
- 3604 Views
- 3 replies
- 0 kudos
I have created a databricks in azure. I have created a cluster for python 3. I am creating a job using spark-submit parameters. How to specify multiple files in --py-files in spark-submit command for databricks job? All the files to be specified in ...
- 3604 Views
- 3 replies
- 0 kudos
Latest Reply
Hi @Nandha Kumar,please go through the below docs to pass python files as job,https://docs.databricks.com/dev-tools/api/latest/jobs.html#sparkpythontask
2 More Replies
- 4702 Views
- 1 replies
- 0 kudos
I'm reading binary file http://snap.stanford.edu/data/amazon/productGraph/image_features/image_features.b using pyspark.from io importStringIO import array
img_embedding_file = sc.binaryRecords("s3://bucket/image_features.b",4106)def mapper(featur...
- 4702 Views
- 1 replies
- 0 kudos
Latest Reply
Hi @tourist_on_road, please go through the below spark docs,https://spark.apache.org/docs/2.3.0/api/python/pyspark.html#pyspark.SparkContext.binaryFiles
- 16658 Views
- 3 replies
- 0 kudos
We are using the databricks 3 node cluster with 32 GB memory. It is working fine but some times it automatically throwing the error: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues.
- 16658 Views
- 3 replies
- 0 kudos
Latest Reply
If your job fails follow this:According to https://docs.databricks.com/jobs.html#jar-job-tips:
"Job output, such as log output emitted to stdout, is subject to a 20MB size limit. If the total output has a larger size, the run will be canceled and ma...
2 More Replies
by
MikeK_
• New Contributor II
- 13160 Views
- 1 replies
- 0 kudos
Hi,
In an SQL notebook, using this link: https://docs.databricks.com/spark/latest/spark-sql/language-manual/set.html I managed to figure out to set values and how to get the value.
SET my_val=10; //saves the value 10 for key my_val
SET my_val; //dis...
- 13160 Views
- 1 replies
- 0 kudos
Latest Reply
Hi @Mike K.., you can do this with widgets and getArgument. Here's a small example of what that might look like: https://community.databricks.com/s/feed/0D53f00001HKHZfCAP
by
kruhly
• New Contributor II
- 27467 Views
- 12 replies
- 0 kudos
I would like to keep only one of the columns used to join the dataframes. Using select() after the join does not seem straight forward because the real data may have many columns or the column names may not be known. A simple example belowllist = [(...
- 27467 Views
- 12 replies
- 0 kudos
Latest Reply
This is an expected behavior. DataFrame.join method is equivalent to SQL join like thisSELECT*FROM a JOIN b ON joinExprsIf you want to ignore duplicate columns just drop them or select columns of interest afterwards. If you want to disambiguate you c...
11 More Replies
- 13014 Views
- 2 replies
- 0 kudos
Hello ! I 'm rookie to spark scala, here is my problem : tk's in advance for your help
my input dataframe looks like this :
index bucket time ap station rssi 0 1 00:00 1 1 -84.0 1 1 00:00 1 3 -67.0 2 1 00:00 1 4 -82.0 3 1 00:00 1 2 -68.0 4 1 00:00...
- 13014 Views
- 2 replies
- 0 kudos
Latest Reply
Looping is not always necessary, I always use this foreach method, something like the following:
aps.collect().foreach(row => <do something>)
1 More Replies