Data Engineering

Forum Posts

Sorted by:

by WajdiFATHALLAH • New Contributor

05-18-2017 2:18:23 AM

8789 Views
4 replies
0 kudos

Writing large parquet file (500 millions row / 1000 columns) to S3 takes too much time

Hello community,First let me introduce my use case, i daily receive a 500 million rows like so :ID | Categories 1 | cat1, cat2, cat3, ..., catn 2 | cat1, catx, caty, ..., anothercategory Input data: 50 compressed csv files each file is 250 MB ...

Data Engineering

8789 Views
4 replies
0 kudos

05-18-2017 2:18:23 AM

View Replies

Latest Reply

EliasHaydar
New Contributor II

08-13-2018 5:16:32 AM

0 kudos

So you are basically creating an inverted index ?

0 kudos

08-13-2018 5:16:32 AM

3 More Replies

by z160896 • New Contributor II

08-06-2018 8:37:52 AM

6358 Views
2 replies
0 kudos

why spark very slow with large number of dataframe columns

scala Spark App: I have a dataset of 130x14000. I read from a parquet file with SparkSession. Then used for Spark ML Random Forest model (using pipeline). It takes 7 hours to complete! for reading the parquet file takes about 1 minute. If I implemen...

Data Engineering

6358 Views
2 replies
0 kudos

08-06-2018 8:37:52 AM

View Replies

Latest Reply

EliasHaydar
New Contributor II

08-13-2018 5:11:26 AM

0 kudos

I've already answered a similar question on StackOverflow so I'll repeat what a I said there. The following may not solve your problem completely but it should give you some pointer to start. The first problem that you are facing is the disproportio...

0 kudos

08-13-2018 5:11:26 AM

1 More Replies

by vin007 • New Contributor

08-02-2018 12:09:24 AM

5023 Views
1 replies
0 kudos

How to store a pyspark dataframe in S3 bucket.

I have a pyspark dataframe df containing 4 columns. How can I write this dataframe to s3 bucket? I'm using pycharm to execute the code. and what are the packages required to be installed?

Data Engineering

5023 Views
1 replies
0 kudos

08-02-2018 12:09:24 AM

View Replies

Latest Reply

AndrewSears
New Contributor III

08-04-2018 4:16:04 AM

0 kudos

You shouldn't need any packages. You can mount S3 bucket to Databricks cluster. https://docs.databricks.com/spark/latest/data-sources/aws/amazon-s3.html#mount-aws-s3 or this http://www.sparktutorials.net/Reading+and+Writing+S3+Data+with+Apache+Spark...

0 kudos

08-04-2018 4:16:04 AM

by SiddarthaPaturu • New Contributor II

03-31-2016 1:53:51 PM

22993 Views
8 replies
0 kudos

Resolved! Comparing two dataframes

How can we compare two data frames using pyspark I need to validate my output with another dataset

Data Engineering

22993 Views
8 replies
0 kudos

03-31-2016 1:53:51 PM

View Replies

Latest Reply

sbharti
New Contributor II

06-28-2018 6:53:44 AM

0 kudos

I think the best bet in such a case is to take inner join (equivalent to intersection) by putting a condition on those columns which necessarily need to have same value in both dataframes. For example, let df1 and df2 are two dataframes. df1 has co...

0 kudos

06-28-2018 6:53:44 AM

7 More Replies

by mlm • New Contributor

11-02-2015 10:43:25 AM

7859 Views
5 replies
0 kudos

How to prevent spark-csv from adding quotes to JSON string in dataframe

I have a sql dataframe with a column that has a json string in it (e.g. {"key":"value"}). When I use spark-csv to save the dataframe it changes the field values to be "{""key"":""valule""}". Is there a way to turn that off?

Data Engineering

7859 Views
5 replies
0 kudos

11-02-2015 10:43:25 AM

View Replies

Latest Reply

AshleyPan
New Contributor II

06-14-2018 11:11:44 AM

0 kudos

Do quote or escape options only work with "Write" instead of "read"? Our source files contain doube quotes. We'd like to add backsplash (escape) in front each double quote before converting the values from out dataframes to json outputs.

0 kudos

06-14-2018 11:11:44 AM

4 More Replies

by bkr • New Contributor

06-08-2018 6:29:45 AM

5642 Views
1 replies
0 kudos

How to move files of same extension in databricks files system?

I am facing file not found exception when i am trying to move the file with * in DBFS. Here both source and destination directories are in DBFS. I have the source file named "test_sample.csv" available in dbfs directory and i am using the command li...

Data Engineering

5642 Views
1 replies
0 kudos

06-08-2018 6:29:45 AM

View Replies

Latest Reply

ricardo_portill
New Contributor III

06-08-2018 9:45:33 AM

0 kudos

@bkr, you can reference the file name using dbutils and then pass this to the move command. Here's an example for this in Scala: val fileNm = dbutils.fs.ls("/usr/krishna/sample").map(_.name).filter(r => r.startsWith("test"))(0) val fileLoc = "dbfs:/...

0 kudos

06-08-2018 9:45:33 AM

by User16826991422 • Contributor

02-10-2016 10:07:06 AM

5305 Views
5 replies
0 kudos

Resolved! How do I get a cartesian product of a huge dataset?

A cartesian product is a common operation to get the cross product of two tables. For example, say you have a list of customers and a list of your product catalog and want to get the cross product of all customer - product combinations. Cartesian pr...

Data Engineering

5305 Views
5 replies
0 kudos

02-10-2016 10:07:06 AM

View Replies

Latest Reply

Forum_Admin
Contributor

05-10-2018 2:12:21 AM

0 kudos

Hi buddies, it is great written piece entirely defined, continue the good work constantly.

0 kudos

05-10-2018 2:12:21 AM

4 More Replies

by vanshikagupta • New Contributor II

05-09-2018 8:19:30 AM

6655 Views
2 replies
0 kudos

conversion of code from scala to python

does databricks community edition provides with databricks ML visualization for pyspark, just the same as provided in this link for scala. https://docs.azuredatabricks.net/_static/notebooks/decision-trees.html also please help me to convert this lin...

Data Engineering

6655 Views
2 replies
0 kudos

05-09-2018 8:19:30 AM

View Replies

Latest Reply

miklos
Contributor

05-09-2018 8:29:47 AM

0 kudos

Yes, CE supports it. It isn't supported in python yet.

0 kudos

05-09-2018 8:29:47 AM

1 More Replies

by Mahesha999 • New Contributor II

04-27-2018 5:52:00 AM

3569 Views
3 replies
0 kudos

Resolving NoClassDefFoundError: org/apache/spark/Logging exception

I was trying out hbase-spark connector. To start with, I am trying out this code. My pom dependencies are: <dependencies> <dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-core_2.11</artifactId> <version...

Data Engineering

3569 Views
3 replies
0 kudos

04-27-2018 5:52:00 AM

View Replies

Latest Reply

User16301467518
New Contributor II

04-27-2018 7:36:40 AM

0 kudos

The alpha of hbase-spark you're using depends on Spark 1.6 -- see hbase-spark/pom.xml:L33 -- so you'll probably have to stick with 1.6 if you want to use that published jar. For reasons I don't understand hbase-spark was removed in the last couple o...

0 kudos

04-27-2018 7:36:40 AM

2 More Replies

by semihcandoken • New Contributor

08-18-2016 9:29:07 PM

13613 Views
4 replies
0 kudos

How to convert column type from str to date in sparksql when the format is not yyyy-mm-dd?

I imported a large csv file into databricks as a table. I am able to run sql queries on it in a databricks notebook. In my table, I have a column that contains date information in the mm/dd/yyyy format : 12/29/2015 12/30/2015 etc... Databricks impo...

Data Engineering

13613 Views
4 replies
0 kudos

08-18-2016 9:29:07 PM

View Replies

Latest Reply

ShubhamGupta187
New Contributor II

04-19-2018 9:37:52 PM

0 kudos

@josephpconley would it be safe to cast a column that contains null values?

0 kudos

04-19-2018 9:37:52 PM

3 More Replies

by Young_TackPark • New Contributor

01-06-2017 11:22:20 PM

12502 Views
2 replies
0 kudos

upload local files into DBFS

I am using Databricks Notebook Community Edition (2.36) and want to upload a local file into DBFS. Is there any simple Hadoop commands like "hadoop fs -put ..."? Any help would be appreciated.

Data Engineering

12502 Views
2 replies
0 kudos

01-06-2017 11:22:20 PM

View Replies

Latest Reply

sushrutt_12
New Contributor II

03-15-2018 9:07:08 AM

0 kudos

Python 2.7:import urllib.request urllib.urlretrieve("https://github.com/sushrutt12/DataSets/blob/master/final_chris.zip","/tmp/chris_data.zip") dbutils.fs.mv("file:/tmp/chris_data.zip", "dbfs:/data/chris_data.zip")Python 3.x: import urllib.requesturl...

0 kudos

03-15-2018 9:07:08 AM

1 More Replies

by ArvindShyamsund • New Contributor II

11-28-2017 8:54:40 PM

5778 Views
12 replies
0 kudos

Resolved! Custom line separator

I see that https://github.com/apache/spark/pull/18581 will enable defining custom Line Separators for many sources, including CSV. Apart from waiting on this PR to make it into the main Databricks runtime, is there any other alternative to support d...

Data Engineering

5778 Views
12 replies
0 kudos

11-28-2017 8:54:40 PM

View Replies

Latest Reply

DanielTomes
New Contributor II

03-07-2018 2:46:50 PM

0 kudos

You can use newAPIHadoopFile SCALA import org.apache.hadoop.io.LongWritable import org.apache.hadoop.io.Text import org.apache.hadoop.conf.Configuration import org.apache.hadoop.mapreduce.lib.input.TextInputFormat val conf = new Configuration conf.s...

0 kudos

03-07-2018 2:46:50 PM

11 More Replies

by max522over • New Contributor II

06-09-2016 1:22:08 PM

12458 Views
3 replies
0 kudos

Resolved! I've set the partition mode to nonstrict in hive but spark is not seeing it

I've got a table I want to add some data to and it's partitoned. I want to use dynamic partitioning but I get this error org.apache.spark.SparkException: Dynamic partition strict mode requires at least one static partition column. To turn this off ...

Data Engineering

12458 Views
3 replies
0 kudos

06-09-2016 1:22:08 PM

View Replies

Latest Reply

max522over
New Contributor II

06-13-2016 3:53:56 PM

0 kudos

I got it working. This was exactly what I needed. Thank you @Peyman Mohajerian

0 kudos

06-13-2016 3:53:56 PM

2 More Replies

by kkarthik • New Contributor

11-13-2017 9:09:37 PM

3368 Views
1 replies
0 kudos

I want to split a dataframe with date range 1 week, with each week data in different column.

DF Q Date(yyyy-mm-dd) q1 2017-10-01 q2 2017-10-03 q1 2017-10-09 q3 2017-10-06 q2 2017-10-01 q1 2017-10-13 Q1 2017-10-02 Q3 2017-10-21 Q4 2017-10-17 Q5 2017-10-20 Q4 2017-10-31 Q2 2017-10-27 Q5 2017-10-01 Dataframe: ...

Data Engineering

3368 Views
1 replies
0 kudos

11-13-2017 9:09:37 PM

View Replies

Latest Reply

User16857281974
Contributor

11-28-2017 4:24:00 PM

0 kudos

It should just be a matter of applying the correct set of transformations:You can start by adding the week-of-year to each record with the command pyspark.sql.functions.weekofyear(..) and name it something like weekOfYear. See https://spark.apache.or...

0 kudos

11-28-2017 4:24:00 PM

by XinZodl • New Contributor III

11-03-2017 12:01:16 AM

9457 Views
3 replies
1 kudos

Resolved! How to parse a file with newline character, escaped with \ and not quoted

Hi! I am facing an issue when reading and parsing a CSV file. Some records have a newline symbol, "escaped" by a \, and that record not being quoted. The file might look like this: Line1field1;Line1field2.1 \ Line1field2.2;Line1field3; Line2FIeld1;...

Data Engineering

9457 Views
3 replies
1 kudos

11-03-2017 12:01:16 AM

View Replies

Latest Reply

XinZodl
New Contributor III

11-07-2017 11:59:09 PM

1 kudos

Solution is "sparkContext.wholeTextFiles"

1 kudos

11-07-2017 11:59:09 PM

2 More Replies

User

Count

1603

736

344

284

247

Databricks

Forum Posts

Writing large parquet file (500 millions row / 1000 columns) to S3 takes too much time

why spark very slow with large number of dataframe columns

How to store a pyspark dataframe in S3 bucket.

Resolved! Comparing two dataframes

How to prevent spark-csv from adding quotes to JSON string in dataframe

How to move files of same extension in databricks files system?

Resolved! How do I get a cartesian product of a huge dataset?

conversion of code from scala to python

Resolving NoClassDefFoundError: org/apache/spark/Logging exception

How to convert column type from str to date in sparksql when the format is not yyyy-mm-dd?

upload local files into DBFS

Resolved! Custom line separator

Resolved! I've set the partition mode to nonstrict in hive but spark is not seeing it

I want to split a dataframe with date range 1 week, with each week data in different column.

Resolved! How to parse a file with newline character, escaped with \ and not quoted

Does DLT use one single SparkSession?

Optimising Clusters in Databricks on GCP

DLT apply_changes applies only deletes and inserts...

Azure Data Factory and Photon

Scheduled job output export