Topics with Label: Spark

Forum Posts

Sorted by:

by RohiniMathur • New Contributor II

09-23-2019 11:03:39 AM

11142 Views
1 replies
0 kudos

Resolved! Length Value of a column in pyspark

Hello, i am using pyspark 2.12 After Creating Dataframe can we measure the length value for each row. For Example: I am measuring length of a value in column 2 Input file |TYCO|1303| |EMC |120989| |VOLVO|102329| |BMW|130157| |FORD|004| Output ...

Data Engineering

11142 Views
1 replies
0 kudos

09-23-2019 11:03:39 AM

View Replies

Latest Reply

lee
Contributor

09-23-2019 11:13:06 AM

0 kudos

You can use the length function for this from pyspark.sql.functions import length mock_data = [('TYCO', '1303'),('EMC', '120989'), ('VOLVO', '102329'),('BMW', '130157'),('FORD', '004')] df = spark.createDataFrame(mock_data, ['col1', 'col2']) df2 = d...

0 kudos

09-23-2019 11:13:06 AM

by RohiniMathur • New Contributor II

09-23-2019 12:16:16 AM

17730 Views
4 replies
0 kudos

Removing non-ascii and special character in pyspark

i am running spark 2.4.4 with python 2.7 and IDE is pycharm. The Input file (.csv) contain encoded value in some column like given below. File data looks COL1,COL2,COL3,COL4 CM, 503004, (d$όνυ$F|'.h*Λ!ψμ=(.ξ; ,.ʽ|!3-2-704 The output i am trying ...

Data Engineering

17730 Views
4 replies
0 kudos

09-23-2019 12:16:16 AM

View Replies

Latest Reply

shyam_9
Valued Contributor

09-23-2019 12:57:02 AM

0 kudos

Hi @Rohini Mathur, use below code on column containing non-ascii and special characters.df['column_name'].str.encode('ascii', 'ignore').str.decode('ascii')

0 kudos

09-23-2019 12:57:02 AM

3 More Replies

by akj2784 • New Contributor II

09-19-2019 12:05:10 AM

5967 Views
5 replies
0 kudos

How to create a dataframe with the files from S3 bucket

I have connected my S3 bucket from databricks. Using the following command : import urllib import urllib.parse ACCESS_KEY = "Test" SECRET_KEY = "Test" ENCODED_SECRET_KEY = urllib.parse.quote(SECRET_KEY, "") AWS_BUCKET_NAME = "Test" MOUNT_NAME = "...

Data Engineering

5967 Views
5 replies
0 kudos

09-19-2019 12:05:10 AM

View Replies

Latest Reply

shyam_9
Valued Contributor

09-19-2019 12:13:35 AM

0 kudos

Hi @akj2784,Please go through Databricks documentation on working with files in S3,https://docs.databricks.com/spark/latest/data-sources/aws/amazon-s3.html#mount-s3-buckets-with-dbfs

0 kudos

09-19-2019 12:13:35 AM

4 More Replies

by MudassarA • New Contributor II

08-21-2019 3:15:22 PM

13491 Views
1 replies
0 kudos

NameError: name 'col' is not defined

I m executing the below code and using Pyhton in notebook and it appears that the col() function is not getting recognized . I want to know if the col() function belongs to any specific Dataframe library or Python library .I dont want to use pyspark...

Data Engineering

13491 Views
1 replies
0 kudos

08-21-2019 3:15:22 PM

View Replies

Latest Reply

MOHAN_KUMARL_N
New Contributor II

08-22-2019 2:18:12 AM

0 kudos

@mudassar45@gmail.com as the document describe generic column not yet associated. Please refer the below code. display(peopleDF.select("firstName").filter("firstName = 'An'"))

0 kudos

08-22-2019 2:18:12 AM

by Dee • New Contributor

08-14-2018 10:21:15 PM

7378 Views
2 replies
0 kudos

Resolved! How to Change Schema of a Spark SQL

I am new to Spark and just started an online pyspark tutorial. I uploaded the json data in DataBrick and wrote the commands as follows: df = sqlContext.sql("SELECT * FROM people_json") df.printSchema() from pyspark.sql.types import * data_schema =...

Data Engineering

7378 Views
2 replies
0 kudos

08-14-2018 10:21:15 PM

View Replies

Latest Reply

bhanu2448
New Contributor II

07-20-2019 10:24:25 AM

0 kudos

http://www.bigdatainterview.com/

0 kudos

07-20-2019 10:24:25 AM

1 More Replies

by kali_tummala • New Contributor II

06-06-2019 11:29:09 AM

5844 Views
5 replies
0 kudos

Why Databricks spark is faster than AWS EMR Spark ?

https://databricks.com/blog/2017/07/12/benchmarking-big-data-sql-platforms-in-the-cloud.html Hi All, just wondering why Databricks Spark is lot faster on S3 compared with AWS EMR spark both the systems are on spark version 2.4 , is Databricks have ...

Data Engineering

5844 Views
5 replies
0 kudos

06-06-2019 11:29:09 AM

View Replies

Latest Reply

RafiKurlansik
New Contributor III

06-11-2019 6:59:36 PM

0 kudos

I think you can get some pretty good insight into the optimizations on Databricks here:https://docs.databricks.com/delta/delta-on-databricks.html Specifically, check out the sections on caching, z-ordering, and join optimization. There's also a grea...

0 kudos

06-11-2019 6:59:36 PM

4 More Replies

by Pascalvan_Belle • New Contributor

04-16-2019 11:50:04 PM

6374 Views
1 replies
0 kudos

How to create a surrogate key sequence which I can use in SCD cases?

Hi Community I would like to know if there is an option to create an integer sequence which persists even if the cluster is shut down. My target is to use this integer value as a surrogate key to join different tables or do Slowly changing dimensio...

Data Engineering

6374 Views
1 replies
0 kudos

04-16-2019 11:50:04 PM

View Replies

Latest Reply

girivaratharaja
New Contributor III

04-17-2019 2:43:39 PM

0 kudos

Hi @pascalvanbellen ,There is no concept of FK, PK, SK in Spark. But Databricks Delta automatically takes care of SCD type scenarios. https://docs.databricks.com/spark/latest/spark-sql/language-manual/merge-into.html#slowly-changing-data-scd-type-2 ...

0 kudos

04-17-2019 2:43:39 PM

by PranjalThapar • New Contributor

05-04-2017 12:52:26 PM

5529 Views
4 replies
0 kudos

Splitting Date into Year, Month and Day, with inconsistent delimiters

I am trying to split my Date Column which is a String Type right now into 3 columns Year, Month and Date. I use (PySpark): <code>split_date=pyspark.sql.functions.split(df['Date'], '-') df= df.withColumn('Year', split_date.getItem(0)) df= df.wit...

Data Engineering

5529 Views
4 replies
0 kudos

05-04-2017 12:52:26 PM

View Replies

Latest Reply

youssefassouli
New Contributor II

02-26-2019 3:05:19 AM

0 kudos

thank you so much that was halpful

0 kudos

02-26-2019 3:05:19 AM

3 More Replies

by senthilkumar • New Contributor

01-16-2017 6:42:09 AM

14118 Views
1 replies
0 kudos

How filter condition working in spark dataframe?

I have a table in hbase with 1 billions records.I want to filter the records based on certain condition (by date). For example: Dataframe.filter(col(date) === todayDate) Filter will be applied after all records from the table will be loaded into me...

Data Engineering

14118 Views
1 replies
0 kudos

01-16-2017 6:42:09 AM

View Replies

Latest Reply

muk1
New Contributor II

12-19-2018 2:11:07 AM

0 kudos

Hello @senthil kumar To pass external values to the filter (or where) transformations you can use the "lit" function in the following way:Dataframe.filter(col(date) == lit(todayDate))don´t know if that helps. Be careful with the schema infered by th...

0 kudos

12-19-2018 2:11:07 AM

by Tamara • New Contributor III

11-03-2015 4:01:50 AM

8814 Views
8 replies
1 kudos

Resolved! Can I connect to a MS SQL server table in Databricks account?

I'd like to access a table on a MS SQL Server (Microsoft). Is it possible from Databricks? To my understanding, the syntax is something like this (in a SQL Notebook): CREATE TEMPORARY TABLE jdbcTable USING org.apache.spark.sql.jdbc OPTIONS ( url...

Data Engineering

8814 Views
8 replies
1 kudos

11-03-2015 4:01:50 AM

View Replies

Latest Reply

JohnSmith091
New Contributor II

11-27-2018 1:19:31 AM

1 kudos

Thanks for the trick that you have shared with us. I am really amazed to use this informational post. If you are facing MacBook error like MacBook Pro won't turn on black screen then click the link.

1 kudos

11-27-2018 1:19:31 AM

7 More Replies

by SatheesshChinnu • New Contributor III

02-11-2017 5:34:17 PM

8149 Views
4 replies
0 kudos

Resolved! Error: TransportResponseHandler: Still have 1 requests outstanding when connection, occurring only on large dataset.

I am getting below error only during large dataset(i.e 15 TB compressed) . if my dataset is small( 1TB) i am not getting this error. Look like it fails on shuffle stage. Approx number of mappers is 150,000 Spark config:spark.sql.warehouse.dir hdfs:...

Data Engineering

8149 Views
4 replies
0 kudos

02-11-2017 5:34:17 PM

View Replies

Latest Reply

parikshitbhoyar
New Contributor II

09-03-2018 2:20:26 AM

0 kudos

@Satheessh Chinnusamy how did you solve the above issue

0 kudos

09-03-2018 2:20:26 AM

3 More Replies

by WajdiFATHALLAH • New Contributor

05-18-2017 2:18:23 AM

8893 Views
4 replies
0 kudos

Writing large parquet file (500 millions row / 1000 columns) to S3 takes too much time

Hello community,First let me introduce my use case, i daily receive a 500 million rows like so :ID | Categories 1 | cat1, cat2, cat3, ..., catn 2 | cat1, catx, caty, ..., anothercategory Input data: 50 compressed csv files each file is 250 MB ...

Data Engineering

8893 Views
4 replies
0 kudos

05-18-2017 2:18:23 AM

View Replies

Latest Reply

EliasHaydar
New Contributor II

08-13-2018 5:16:32 AM

0 kudos

So you are basically creating an inverted index ?

0 kudos

08-13-2018 5:16:32 AM

3 More Replies

by z160896 • New Contributor II

08-06-2018 8:37:52 AM

6410 Views
2 replies
0 kudos

why spark very slow with large number of dataframe columns

scala Spark App: I have a dataset of 130x14000. I read from a parquet file with SparkSession. Then used for Spark ML Random Forest model (using pipeline). It takes 7 hours to complete! for reading the parquet file takes about 1 minute. If I implemen...

Data Engineering

6410 Views
2 replies
0 kudos

08-06-2018 8:37:52 AM

View Replies

Latest Reply

EliasHaydar
New Contributor II

08-13-2018 5:11:26 AM

0 kudos

I've already answered a similar question on StackOverflow so I'll repeat what a I said there. The following may not solve your problem completely but it should give you some pointer to start. The first problem that you are facing is the disproportio...

0 kudos

08-13-2018 5:11:26 AM

1 More Replies

by Mahesha999 • New Contributor II

04-27-2018 5:52:00 AM

3602 Views
3 replies
0 kudos

Resolving NoClassDefFoundError: org/apache/spark/Logging exception

I was trying out hbase-spark connector. To start with, I am trying out this code. My pom dependencies are: <dependencies> <dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-core_2.11</artifactId> <version...

Data Engineering

3602 Views
3 replies
0 kudos

04-27-2018 5:52:00 AM

View Replies

Latest Reply

User16301467518
New Contributor II

04-27-2018 7:36:40 AM

0 kudos

The alpha of hbase-spark you're using depends on Spark 1.6 -- see hbase-spark/pom.xml:L33 -- so you'll probably have to stick with 1.6 if you want to use that published jar. For reasons I don't understand hbase-spark was removed in the last couple o...

0 kudos

04-27-2018 7:36:40 AM

2 More Replies

by kkarthik • New Contributor

11-13-2017 9:09:37 PM

3386 Views
1 replies
0 kudos

I want to split a dataframe with date range 1 week, with each week data in different column.

DF Q Date(yyyy-mm-dd) q1 2017-10-01 q2 2017-10-03 q1 2017-10-09 q3 2017-10-06 q2 2017-10-01 q1 2017-10-13 Q1 2017-10-02 Q3 2017-10-21 Q4 2017-10-17 Q5 2017-10-20 Q4 2017-10-31 Q2 2017-10-27 Q5 2017-10-01 Dataframe: ...

Data Engineering

3386 Views
1 replies
0 kudos

11-13-2017 9:09:37 PM

View Replies

Latest Reply

User16857281974
Contributor

11-28-2017 4:24:00 PM

0 kudos

It should just be a matter of applying the correct set of transformations:You can start by adding the week-of-year to each record with the command pyspark.sql.functions.weekofyear(..) and name it something like weekOfYear. See https://spark.apache.or...

0 kudos

11-28-2017 4:24:00 PM