- 8789 Views
- 4 replies
- 0 kudos
Hello community,First let me introduce my use case, i daily receive a 500 million rows like so :ID | Categories
1 | cat1, cat2, cat3, ..., catn
2 | cat1, catx, caty, ..., anothercategory
Input data: 50 compressed csv files each file is 250 MB ...
- 8789 Views
- 4 replies
- 0 kudos
Latest Reply
So you are basically creating an inverted index ?
3 More Replies
- 6358 Views
- 2 replies
- 0 kudos
scala Spark App: I have a dataset of 130x14000. I read from a parquet file with SparkSession. Then used for Spark ML Random Forest model (using pipeline). It takes 7 hours to complete! for reading the parquet file takes about 1 minute. If I implemen...
- 6358 Views
- 2 replies
- 0 kudos
Latest Reply
I've already answered a similar question on StackOverflow so I'll repeat what a I said there.
The following may not solve your problem completely but it should give you some pointer to start.
The first problem that you are facing is the disproportio...
1 More Replies
- 5023 Views
- 1 replies
- 0 kudos
I have a pyspark dataframe df containing 4 columns. How can I write this dataframe to s3 bucket?
I'm using pycharm to execute the code. and what are the packages required to be installed?
- 5023 Views
- 1 replies
- 0 kudos
Latest Reply
You shouldn't need any packages. You can mount S3 bucket to Databricks cluster.
https://docs.databricks.com/spark/latest/data-sources/aws/amazon-s3.html#mount-aws-s3
or this
http://www.sparktutorials.net/Reading+and+Writing+S3+Data+with+Apache+Spark...
- 22993 Views
- 8 replies
- 0 kudos
How can we compare two data frames using pyspark
I need to validate my output with another dataset
- 22993 Views
- 8 replies
- 0 kudos
Latest Reply
I think the best bet in such a case is to take inner join (equivalent to intersection) by putting a condition on those columns which necessarily need to have same value in both dataframes. For example,
let df1 and df2 are two dataframes. df1 has co...
7 More Replies
- 7859 Views
- 5 replies
- 0 kudos
I have a sql dataframe with a column that has a json string in it (e.g. {"key":"value"}). When I use spark-csv to save the dataframe it changes the field values to be "{""key"":""valule""}". Is there a way to turn that off?
- 7859 Views
- 5 replies
- 0 kudos
Latest Reply
Do quote or escape options only work with "Write" instead of "read"? Our source files contain doube quotes. We'd like to add backsplash (escape) in front each double quote before converting the values from out dataframes to json outputs.
4 More Replies
- 5642 Views
- 1 replies
- 0 kudos
I am facing file not found exception when i am trying to move the file with * in DBFS. Here both source and destination directories are in DBFS. I have the source file named "test_sample.csv" available in dbfs directory and i am using the command li...
- 5642 Views
- 1 replies
- 0 kudos
Latest Reply
@bkr, you can reference the file name using dbutils and then pass this to the move command. Here's an example for this in Scala:
val fileNm = dbutils.fs.ls("/usr/krishna/sample").map(_.name).filter(r => r.startsWith("test"))(0)
val fileLoc = "dbfs:/...
- 5305 Views
- 5 replies
- 0 kudos
A cartesian product is a common operation to get the cross product of two tables.
For example, say you have a list of customers and a list of your product catalog and want to get the cross product of all customer - product combinations.
Cartesian pr...
- 5305 Views
- 5 replies
- 0 kudos
Latest Reply
Hi buddies, it is great written piece entirely defined, continue the good work constantly.
4 More Replies
- 6655 Views
- 2 replies
- 0 kudos
does databricks community edition provides with databricks ML visualization for pyspark, just the same as provided in this link for scala. https://docs.azuredatabricks.net/_static/notebooks/decision-trees.html
also please help me to convert this lin...
- 6655 Views
- 2 replies
- 0 kudos
Latest Reply
Yes, CE supports it.
It isn't supported in python yet.
1 More Replies
- 3569 Views
- 3 replies
- 0 kudos
I was trying out hbase-spark connector. To start with, I am trying out this code. My pom dependencies are:
<dependencies>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<version...
- 3569 Views
- 3 replies
- 0 kudos
Latest Reply
The alpha of hbase-spark you're using depends on Spark 1.6 -- see hbase-spark/pom.xml:L33 -- so you'll probably have to stick with 1.6 if you want to use that published jar.
For reasons I don't understand hbase-spark was removed in the last couple o...
2 More Replies
- 13613 Views
- 4 replies
- 0 kudos
I imported a large csv file into databricks as a table.
I am able to run sql queries on it in a databricks notebook.
In my table, I have a column that contains date information in the mm/dd/yyyy format :
12/29/2015
12/30/2015 etc...
Databricks impo...
- 13613 Views
- 4 replies
- 0 kudos
Latest Reply
@josephpconley would it be safe to cast a column that contains null values?
3 More Replies
- 12502 Views
- 2 replies
- 0 kudos
I am using Databricks Notebook Community Edition (2.36) and want to upload a local file into DBFS. Is there any simple Hadoop commands like "hadoop fs -put ..."? Any help would be appreciated.
- 12502 Views
- 2 replies
- 0 kudos
Latest Reply
Python 2.7:import urllib.request
urllib.urlretrieve("https://github.com/sushrutt12/DataSets/blob/master/final_chris.zip","/tmp/chris_data.zip") dbutils.fs.mv("file:/tmp/chris_data.zip", "dbfs:/data/chris_data.zip")Python 3.x: import urllib.requesturl...
1 More Replies
- 5778 Views
- 12 replies
- 0 kudos
I see that https://github.com/apache/spark/pull/18581 will enable defining custom Line Separators for many sources, including CSV. Apart from waiting on this PR to make it into the main Databricks runtime, is there any other alternative to support d...
- 5778 Views
- 12 replies
- 0 kudos
Latest Reply
You can use newAPIHadoopFile
SCALA
import org.apache.hadoop.io.LongWritable
import org.apache.hadoop.io.Text
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat
val conf = new Configuration
conf.s...
11 More Replies
- 12458 Views
- 3 replies
- 0 kudos
I've got a table I want to add some data to and it's partitoned. I want to use dynamic partitioning but I get this error
org.apache.spark.SparkException: Dynamic partition strict mode requires at least one static partition column. To turn this off ...
- 12458 Views
- 3 replies
- 0 kudos
Latest Reply
I got it working. This was exactly what I needed. Thank you @Peyman Mohajerian​
2 More Replies
- 3368 Views
- 1 replies
- 0 kudos
DF
Q Date(yyyy-mm-dd)
q1 2017-10-01
q2 2017-10-03
q1 2017-10-09
q3 2017-10-06
q2 2017-10-01
q1 2017-10-13
Q1 2017-10-02
Q3 2017-10-21
Q4 2017-10-17
Q5 2017-10-20
Q4 2017-10-31
Q2 2017-10-27
Q5 2017-10-01
Dataframe:
...
- 3368 Views
- 1 replies
- 0 kudos
Latest Reply
It should just be a matter of applying the correct set of transformations:You can start by adding the week-of-year to each record with the command pyspark.sql.functions.weekofyear(..) and name it something like weekOfYear. See https://spark.apache.or...
- 9457 Views
- 3 replies
- 1 kudos
Hi!
I am facing an issue when reading and parsing a CSV file. Some records have a newline symbol, "escaped" by a \, and that record not being quoted. The file might look like this:
Line1field1;Line1field2.1 \
Line1field2.2;Line1field3;
Line2FIeld1;...
- 9457 Views
- 3 replies
- 1 kudos