cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 
Data + AI Summit 2024 - Data Engineering & Streaming

Forum Posts

z160896
by New Contributor II
  • 7847 Views
  • 2 replies
  • 0 kudos

why spark very slow with large number of dataframe columns

scala Spark App: I have a dataset of 130x14000. I read from a parquet file with SparkSession. Then used for Spark ML Random Forest model (using pipeline). It takes 7 hours to complete! for reading the parquet file takes about 1 minute. If I implemen...

  • 7847 Views
  • 2 replies
  • 0 kudos
Latest Reply
EliasHaydar
New Contributor II
  • 0 kudos

I've already answered a similar question on StackOverflow so I'll repeat what a I said there. The following may not solve your problem completely but it should give you some pointer to start. The first problem that you are facing is the disproportio...

  • 0 kudos
1 More Replies
vin007
by New Contributor
  • 7049 Views
  • 1 replies
  • 0 kudos

How to store a pyspark dataframe in S3 bucket.

I have a pyspark dataframe df containing 4 columns. How can I write this dataframe to s3 bucket? I'm using pycharm to execute the code. and what are the packages required to be installed?

  • 7049 Views
  • 1 replies
  • 0 kudos
Latest Reply
AndrewSears
New Contributor III
  • 0 kudos

You shouldn't need any packages. You can mount S3 bucket to Databricks cluster. https://docs.databricks.com/spark/latest/data-sources/aws/amazon-s3.html#mount-aws-s3 or this http://www.sparktutorials.net/Reading+and+Writing+S3+Data+with+Apache+Spark...

  • 0 kudos
SiddarthaPaturu
by New Contributor II
  • 28167 Views
  • 8 replies
  • 1 kudos

Resolved! Comparing two dataframes

How can we compare two data frames using pyspark I need to validate my output with another dataset

  • 28167 Views
  • 8 replies
  • 1 kudos
Latest Reply
sbharti
New Contributor II
  • 1 kudos

I think the best bet in such a case is to take inner join (equivalent to intersection) by putting a condition on those columns which necessarily need to have same value in both dataframes. For example, let df1 and df2 are two dataframes. df1 has co...

  • 1 kudos
7 More Replies
mlm
by New Contributor
  • 11340 Views
  • 5 replies
  • 0 kudos

How to prevent spark-csv from adding quotes to JSON string in dataframe

I have a sql dataframe with a column that has a json string in it (e.g. {"key":"value"}). When I use spark-csv to save the dataframe it changes the field values to be "{""key"":""valule""}". Is there a way to turn that off?

  • 11340 Views
  • 5 replies
  • 0 kudos
Latest Reply
AshleyPan
New Contributor II
  • 0 kudos

Do quote or escape options only work with "Write" instead of "read"? Our source files contain doube quotes. We'd like to add backsplash (escape) in front each double quote before converting the values from out dataframes to json outputs.

  • 0 kudos
4 More Replies
bkr
by New Contributor
  • 6164 Views
  • 1 replies
  • 0 kudos

How to move files of same extension in databricks files system?

I am facing file not found exception when i am trying to move the file with * in DBFS. Here both source and destination directories are in DBFS. I have the source file named "test_sample.csv" available in dbfs directory and i am using the command li...

  • 6164 Views
  • 1 replies
  • 0 kudos
Latest Reply
ricardo_portill
New Contributor III
  • 0 kudos

@bkr, you can reference the file name using dbutils and then pass this to the move command. Here's an example for this in Scala: val fileNm = dbutils.fs.ls("/usr/krishna/sample").map(_.name).filter(r => r.startsWith("test"))(0) val fileLoc = "dbfs:/...

  • 0 kudos
rlgarris
by Databricks Employee
  • 7306 Views
  • 5 replies
  • 0 kudos

Resolved! How do I get a cartesian product of a huge dataset?

A cartesian product is a common operation to get the cross product of two tables. For example, say you have a list of customers and a list of your product catalog and want to get the cross product of all customer - product combinations. Cartesian pr...

  • 7306 Views
  • 5 replies
  • 0 kudos
Latest Reply
Forum_Admin
Contributor
  • 0 kudos

Hi buddies, it is great written piece entirely defined, continue the good work constantly.

  • 0 kudos
4 More Replies
vanshikagupta
by New Contributor II
  • 7425 Views
  • 2 replies
  • 0 kudos

conversion of code from scala to python

does databricks community edition provides with databricks ML visualization for pyspark, just the same as provided in this link for scala. https://docs.azuredatabricks.net/_static/notebooks/decision-trees.html also please help me to convert this lin...

  • 7425 Views
  • 2 replies
  • 0 kudos
Latest Reply
miklos
Contributor
  • 0 kudos

Yes, CE supports it. It isn't supported in python yet.

  • 0 kudos
1 More Replies
Mahesha999
by New Contributor II
  • 4748 Views
  • 3 replies
  • 0 kudos

Resolving NoClassDefFoundError: org/apache/spark/Logging exception

I was trying out hbase-spark connector. To start with, I am trying out this code. My pom dependencies are: <dependencies> <dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-core_2.11</artifactId> <version...

  • 4748 Views
  • 3 replies
  • 0 kudos
Latest Reply
User16301467518
New Contributor II
  • 0 kudos

The alpha of hbase-spark you're using depends on Spark 1.6 -- see hbase-spark/pom.xml:L33 -- so you'll probably have to stick with 1.6 if you want to use that published jar. For reasons I don't understand hbase-spark was removed in the last couple o...

  • 0 kudos
2 More Replies
semihcandoken
by New Contributor
  • 15952 Views
  • 4 replies
  • 0 kudos

How to convert column type from str to date in sparksql when the format is not yyyy-mm-dd?

I imported a large csv file into databricks as a table. I am able to run sql queries on it in a databricks notebook. In my table, I have a column that contains date information in the mm/dd/yyyy format : 12/29/2015 12/30/2015 etc... Databricks impo...

  • 15952 Views
  • 4 replies
  • 0 kudos
Latest Reply
ShubhamGupta187
New Contributor II
  • 0 kudos

@josephpconley would it be safe to cast a column that contains null values?

  • 0 kudos
3 More Replies
Young_TackPark
by New Contributor
  • 17327 Views
  • 2 replies
  • 0 kudos

upload local files into DBFS

I am using Databricks Notebook Community Edition (2.36) and want to upload a local file into DBFS. Is there any simple Hadoop commands like "hadoop fs -put ..."? Any help would be appreciated.

  • 17327 Views
  • 2 replies
  • 0 kudos
Latest Reply
sushrutt_12
New Contributor II
  • 0 kudos

Python 2.7:import urllib.request urllib.urlretrieve("https://github.com/sushrutt12/DataSets/blob/master/final_chris.zip","/tmp/chris_data.zip") dbutils.fs.mv("file:/tmp/chris_data.zip", "dbfs:/data/chris_data.zip")Python 3.x: import urllib.requesturl...

  • 0 kudos
1 More Replies
ArvindShyamsund
by New Contributor II
  • 9008 Views
  • 12 replies
  • 0 kudos

Resolved! Custom line separator

I see that https://github.com/apache/spark/pull/18581 will enable defining custom Line Separators for many sources, including CSV. Apart from waiting on this PR to make it into the main Databricks runtime, is there any other alternative to support d...

  • 9008 Views
  • 12 replies
  • 0 kudos
Latest Reply
DanielTomes
New Contributor II
  • 0 kudos

You can use newAPIHadoopFile SCALA import org.apache.hadoop.io.LongWritable import org.apache.hadoop.io.Text import org.apache.hadoop.conf.Configuration import org.apache.hadoop.mapreduce.lib.input.TextInputFormat val conf = new Configuration conf.s...

  • 0 kudos
11 More Replies
max522over
by New Contributor II
  • 15038 Views
  • 3 replies
  • 0 kudos

Resolved! I've set the partition mode to nonstrict in hive but spark is not seeing it

I've got a table I want to add some data to and it's partitoned. I want to use dynamic partitioning but I get this error org.apache.spark.SparkException: Dynamic partition strict mode requires at least one static partition column. To turn this off ...

  • 15038 Views
  • 3 replies
  • 0 kudos
Latest Reply
max522over
New Contributor II
  • 0 kudos

I got it working. This was exactly what I needed. Thank you @Peyman Mohajerian​ 

  • 0 kudos
2 More Replies
kkarthik
by New Contributor
  • 4265 Views
  • 1 replies
  • 0 kudos

I want to split a dataframe with date range 1 week, with each week data in different column.

DF Q Date(yyyy-mm-dd) q1 2017-10-01 q2 2017-10-03 q1 2017-10-09 q3 2017-10-06 q2 2017-10-01 q1 2017-10-13 Q1 2017-10-02 Q3 2017-10-21 Q4 2017-10-17 Q5 2017-10-20 Q4 2017-10-31 Q2 2017-10-27 Q5 2017-10-01 Dataframe: ...

  • 4265 Views
  • 1 replies
  • 0 kudos
Latest Reply
User16857281974
Contributor
  • 0 kudos

It should just be a matter of applying the correct set of transformations:You can start by adding the week-of-year to each record with the command pyspark.sql.functions.weekofyear(..) and name it something like weekOfYear. See https://spark.apache.or...

  • 0 kudos
SethuSrinivasan
by New Contributor II
  • 27093 Views
  • 0 replies
  • 1 kudos

Requesting support for "SELECT TOP n from Table"

In notebook, It looks like if I need to select top N rows, I can rely on "LIMIT" keyword. It would be nice if you can support "TOP" as well The current approach to select 10 rows: select * from table1 LIMIT 10 Requesting TOP support: SELECT TOP 10 *...

  • 27093 Views
  • 0 replies
  • 1 kudos

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group
Labels