- 11142 Views
- 1 replies
- 0 kudos
Hello,
i am using pyspark 2.12
After Creating Dataframe can we measure the length value for each row.
For Example: I am measuring length of a value in column 2
Input file
|TYCO|1303|
|EMC |120989|
|VOLVO|102329|
|BMW|130157|
|FORD|004|
Output ...
- 11142 Views
- 1 replies
- 0 kudos
Latest Reply
You can use the length function for this
from pyspark.sql.functions import length
mock_data = [('TYCO', '1303'),('EMC', '120989'), ('VOLVO', '102329'),('BMW', '130157'),('FORD', '004')]
df = spark.createDataFrame(mock_data, ['col1', 'col2'])
df2 = d...
- 17730 Views
- 4 replies
- 0 kudos
i am running spark 2.4.4 with python 2.7 and IDE is pycharm.
The Input file (.csv) contain encoded value in some column like given below.
File data looks
COL1,COL2,COL3,COL4
CM, 503004, (d$όνυ$F|'.h*Λ!ψμ=(.ξ; ,.ʽ|!3-2-704
The output i am trying ...
- 17730 Views
- 4 replies
- 0 kudos
Latest Reply
Hi @Rohini Mathur, use below code on column containing non-ascii and special characters.df['column_name'].str.encode('ascii', 'ignore').str.decode('ascii')
3 More Replies
- 5967 Views
- 5 replies
- 0 kudos
I have connected my S3 bucket from databricks.
Using the following command :
import urllib
import urllib.parse
ACCESS_KEY = "Test"
SECRET_KEY = "Test"
ENCODED_SECRET_KEY = urllib.parse.quote(SECRET_KEY, "") AWS_BUCKET_NAME = "Test" MOUNT_NAME = "...
- 5967 Views
- 5 replies
- 0 kudos
Latest Reply
Hi @akj2784,Please go through Databricks documentation on working with files in S3,https://docs.databricks.com/spark/latest/data-sources/aws/amazon-s3.html#mount-s3-buckets-with-dbfs
4 More Replies
- 13491 Views
- 1 replies
- 0 kudos
I m executing the below code and using Pyhton in notebook and it appears that the col() function is not getting recognized .
I want to know if the col() function belongs to any specific Dataframe library or Python library .I dont want to use pyspark...
- 13491 Views
- 1 replies
- 0 kudos
Latest Reply
@mudassar45@gmail.com
as the document describe generic column not yet associated. Please refer the below code.
display(peopleDF.select("firstName").filter("firstName = 'An'"))
- 7378 Views
- 2 replies
- 0 kudos
I am new to Spark and just started an online pyspark tutorial. I uploaded the json data in DataBrick and wrote the commands as follows:
df = sqlContext.sql("SELECT * FROM people_json")
df.printSchema()
from pyspark.sql.types import *
data_schema =...
- 7378 Views
- 2 replies
- 0 kudos
- 5844 Views
- 5 replies
- 0 kudos
https://databricks.com/blog/2017/07/12/benchmarking-big-data-sql-platforms-in-the-cloud.html
Hi All,
just wondering why Databricks Spark is lot faster on S3 compared with AWS EMR spark both the systems are on spark version 2.4 , is Databricks have ...
- 5844 Views
- 5 replies
- 0 kudos
Latest Reply
I think you can get some pretty good insight into the optimizations on Databricks here:https://docs.databricks.com/delta/delta-on-databricks.html
Specifically, check out the sections on caching, z-ordering, and join optimization. There's also a grea...
4 More Replies
- 6374 Views
- 1 replies
- 0 kudos
Hi Community
I would like to know if there is an option to create an integer sequence which persists even if the cluster is shut down. My target is to use this integer value as a surrogate key to join different tables or do Slowly changing dimensio...
- 6374 Views
- 1 replies
- 0 kudos
Latest Reply
Hi @pascalvanbellen ,There is no concept of FK, PK, SK in Spark. But Databricks Delta automatically takes care of SCD type scenarios. https://docs.databricks.com/spark/latest/spark-sql/language-manual/merge-into.html#slowly-changing-data-scd-type-2
...
- 5529 Views
- 4 replies
- 0 kudos
I am trying to split my Date Column which is a String Type right now into 3 columns Year, Month and Date. I use (PySpark):
<code>split_date=pyspark.sql.functions.split(df['Date'], '-')
df= df.withColumn('Year', split_date.getItem(0))
df= df.wit...
- 5529 Views
- 4 replies
- 0 kudos
- 14118 Views
- 1 replies
- 0 kudos
I have a table in hbase with 1 billions records.I want to filter the records based on certain condition (by date).
For example:
Dataframe.filter(col(date) === todayDate)
Filter will be applied after all records from the table will be loaded into me...
- 14118 Views
- 1 replies
- 0 kudos
Latest Reply
Hello @senthil kumar​ To pass external values to the filter (or where) transformations you can use the "lit" function in the following way:Dataframe.filter(col(date) == lit(todayDate))don´t know if that helps. Be careful with the schema infered by th...
by
Tamara
• New Contributor III
- 8814 Views
- 8 replies
- 1 kudos
I'd like to access a table on a MS SQL Server (Microsoft). Is it possible from Databricks?
To my understanding, the syntax is something like this (in a SQL Notebook):
CREATE TEMPORARY TABLE jdbcTable
USING org.apache.spark.sql.jdbc
OPTIONS ( url...
- 8814 Views
- 8 replies
- 1 kudos
Latest Reply
Thanks for the trick that you have shared with us. I am really amazed to use this informational post. If you are facing MacBook error like MacBook Pro won't turn on black screen then click the link.
7 More Replies
- 8149 Views
- 4 replies
- 0 kudos
I am getting below error only during large dataset(i.e 15 TB compressed) . if my dataset is small( 1TB) i am not getting this error.
Look like it fails on shuffle stage. Approx number of mappers is 150,000
Spark config:spark.sql.warehouse.dir hdfs:...
- 8149 Views
- 4 replies
- 0 kudos
Latest Reply
@Satheessh Chinnusamy how did you solve the above issue
3 More Replies
- 8893 Views
- 4 replies
- 0 kudos
Hello community,First let me introduce my use case, i daily receive a 500 million rows like so :ID | Categories
1 | cat1, cat2, cat3, ..., catn
2 | cat1, catx, caty, ..., anothercategory
Input data: 50 compressed csv files each file is 250 MB ...
- 8893 Views
- 4 replies
- 0 kudos
Latest Reply
So you are basically creating an inverted index ?
3 More Replies
- 6410 Views
- 2 replies
- 0 kudos
scala Spark App: I have a dataset of 130x14000. I read from a parquet file with SparkSession. Then used for Spark ML Random Forest model (using pipeline). It takes 7 hours to complete! for reading the parquet file takes about 1 minute. If I implemen...
- 6410 Views
- 2 replies
- 0 kudos
Latest Reply
I've already answered a similar question on StackOverflow so I'll repeat what a I said there.
The following may not solve your problem completely but it should give you some pointer to start.
The first problem that you are facing is the disproportio...
1 More Replies
- 3602 Views
- 3 replies
- 0 kudos
I was trying out hbase-spark connector. To start with, I am trying out this code. My pom dependencies are:
<dependencies>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<version...
- 3602 Views
- 3 replies
- 0 kudos
Latest Reply
The alpha of hbase-spark you're using depends on Spark 1.6 -- see hbase-spark/pom.xml:L33 -- so you'll probably have to stick with 1.6 if you want to use that published jar.
For reasons I don't understand hbase-spark was removed in the last couple o...
2 More Replies
- 3386 Views
- 1 replies
- 0 kudos
DF
Q Date(yyyy-mm-dd)
q1 2017-10-01
q2 2017-10-03
q1 2017-10-09
q3 2017-10-06
q2 2017-10-01
q1 2017-10-13
Q1 2017-10-02
Q3 2017-10-21
Q4 2017-10-17
Q5 2017-10-20
Q4 2017-10-31
Q2 2017-10-27
Q5 2017-10-01
Dataframe:
...
- 3386 Views
- 1 replies
- 0 kudos
Latest Reply
It should just be a matter of applying the correct set of transformations:You can start by adding the week-of-year to each record with the command pyspark.sql.functions.weekofyear(..) and name it something like weekOfYear. See https://spark.apache.or...