cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Forum Posts

bhaumikg
by New Contributor II
  • 14530 Views
  • 7 replies
  • 2 kudos

Databricks throwing error "SQL DW failed to execute the JDBC query produced by the connector." while pushing the column with string length more than 255

I am using databricks to transform the data and than pushing the data into datalake. the data is getting pushed in if the length of the string field is 255 or less but it throws following error if it is beyond that. "SQL DW failed to execute the JDB...

  • 14530 Views
  • 7 replies
  • 2 kudos
Latest Reply
bhaumikg
New Contributor II
  • 2 kudos

As suggested by ZAIvR, please use append and provide maxlength while pushing the data. Overwrite may not work with this unless databricks team has fixed the issue

  • 2 kudos
6 More Replies
RaghuMundru
by New Contributor III
  • 32070 Views
  • 15 replies
  • 0 kudos

Resolved! I am running simple count and I am getting an error

Here is the error that I am getting when I run the following query statement=sqlContext.sql("SELECT count(*) FROM ARDATA_2015_09_01").show() ---------------------------------------------------------------------------Py4JJavaError Traceback (most rec...

  • 32070 Views
  • 15 replies
  • 0 kudos
Latest Reply
muchave
New Contributor II
  • 0 kudos

192.168.o.1 is a private IP address used to login the admin panel of a router. 192.168.l.l is the host address to change default router settings.

  • 0 kudos
14 More Replies
SohelKhan
by New Contributor II
  • 10860 Views
  • 3 replies
  • 0 kudos

PySpark DataFrame: Select all but one or a set of columns

In SQL select, in some implementation, we can provide select -col_A to select all columns except the col_A. I tried it in the Spark 1.6.0 as follows: For a dataframe df with three columns col_A, col_B, col_C df.select('col_B, 'col_C') # it works df....

  • 10860 Views
  • 3 replies
  • 0 kudos
Latest Reply
NavitaJain
New Contributor II
  • 0 kudos

@sk777, @zjffdu, @Lejla Metohajrova if your columns are time-series ordered OR you want to maintain their original order... use cols = [c for c in df.columns if c != 'col_A'] df[cols]

  • 0 kudos
2 More Replies
Venkata_Krishna
by New Contributor
  • 8653 Views
  • 1 replies
  • 0 kudos

convert string dataframe column MM/dd/yyyy hh:mm:ss AM/PM to timestamp MM-dd-yyyy hh:mm:ss

How to convert string 6/3/2019 5:06:00 AM to timestamp in 24 hour format MM-dd-yyyy hh:mm:ss in python spark.

  • 8653 Views
  • 1 replies
  • 0 kudos
Latest Reply
lee
Contributor
  • 0 kudos

You would use a combination of the functions: pyspark.sql.functions.from_unixtime(timestamp, format='yyyy-MM-dd HH:mm:ss') (documentation) and pyspark.sql.functions.unix_timestamp(timestamp=None, format='yyyy-MM-dd HH:mm:ss') (documentation)from pysp...

  • 0 kudos
LaurentThiebaud
by New Contributor
  • 5482 Views
  • 1 replies
  • 0 kudos

Sort within a groupBy with dataframe

Using Spark DataFrame, eg. myDf .filter(col("timestamp").gt(15000)) .groupBy("groupingKey") .agg(collect_list("aDoubleValue")) I want the collect_list to return the result, but ordered according to "timestamp". i.a. I want the GroupBy results...

  • 5482 Views
  • 1 replies
  • 0 kudos
Latest Reply
shyam_9
Valued Contributor
  • 0 kudos

Hi @Laurent Thiebaud,Please use the below format to sort within a groupby, import org.apache.spark.sql.functions._ df.groupBy("columnA").agg(sort_array(collect_list("columnB")))

  • 0 kudos
shampa
by New Contributor
  • 4731 Views
  • 1 replies
  • 0 kudos

How can we compare two dataframes in spark scala to find difference between these 2 files, which column ?? and value ??.

I have two files and I created two dataframes prod1 and prod2 out of it.I need to find the records with column names and values that are not matching in both the dfs. id_sk is the primary key .all the cols are string datatype dataframe 1 (prod1) id_...

  • 4731 Views
  • 1 replies
  • 0 kudos
Latest Reply
manojlukhi
New Contributor II
  • 0 kudos

use full Outer Join in spark SQL

  • 0 kudos
supriya
by New Contributor II
  • 10611 Views
  • 12 replies
  • 0 kudos

How to append new column values in dataframe behalf of unique id's

I need to create new column with data in dataframe. Example:val test = sqlContext.createDataFrame(Seq( (4L, "spark i j k"), (5L, "l m n"), (6L, "mapreduce spark"), (7L, "apache hadoop"), (11L, "a b c d e spark"), (12L, "b d"), (13L, "spark f g h"), ...

  • 10611 Views
  • 12 replies
  • 0 kudos
Latest Reply
raela
New Contributor III
  • 0 kudos

@supriya you will have to do a join. import org.apache.spark.sql.functions._ val joined = test.join(tuples, col("id") === col("tupleid"), "inner").select("id", "text", "average")

  • 0 kudos
11 More Replies
SohelKhan
by New Contributor II
  • 14554 Views
  • 5 replies
  • 0 kudos

Resolved! Pyspark DataFrame: Converting one column from string to float/double

Pyspark 1.6: DataFrame: Converting one column from string to float/double I have two columns in a dataframe both of which are loaded as string. DF = rawdata.select('house name', 'price') I want to convert DF.price to float. DF = rawdata.select('hous...

  • 14554 Views
  • 5 replies
  • 0 kudos
Latest Reply
AidanCondron
New Contributor II
  • 0 kudos

Slightly simpler: df_num = df.select(df.employment.cast("float"), df.education.cast("float"), df.health.cast("float")) This works with multiple columns, three shown here.

  • 0 kudos
4 More Replies
Jean-FrancoisRa
by New Contributor
  • 3463 Views
  • 2 replies
  • 0 kudos

Resolved! Select dataframe columns from a sequence of string

Is there a simple way to select columns from a dataframe with a sequence of string? Something like val colNames = Seq("c1", "c2") df.select(colNames)

  • 3463 Views
  • 2 replies
  • 0 kudos
Latest Reply
vEdwardpc
New Contributor II
  • 0 kudos

Thanks. I needed to modify the final lines. val df_new = df.select(column_names_col:_*) df_new.show() Edward

  • 0 kudos
1 More Replies
dheeraj
by New Contributor II
  • 4978 Views
  • 3 replies
  • 0 kudos

How to calculate Percentile of column in a DataFrame in spark?

I am trying to calculate percentile of a column in a DataFrame? I cant find any percentile_approx function in Spark aggregation functions. For e.g. in Hive we have percentile_approx and we can use it in the following way hiveContext.sql("select per...

  • 4978 Views
  • 3 replies
  • 0 kudos
Latest Reply
amandaphy
New Contributor II
  • 0 kudos

You can try using df.registerTempTable("tmp_tbl") val newDF = sql(/ do something with tmp_tbl /)// and continue using newDF Learn More

  • 0 kudos
2 More Replies
longcao
by New Contributor III
  • 11993 Views
  • 5 replies
  • 0 kudos

Resolved! Writing DataFrame to PostgreSQL via JDBC extremely slow (Spark 1.6.1)

Hi there,I'm just getting started with Spark and I've got a moderately sized DataFrame created from collating CSVs in S3 (88 columns, 860k rows) that seems to be taking an unreasonable amount of time to insert (using SaveMode.Append) into Postgres. I...

  • 11993 Views
  • 5 replies
  • 0 kudos
Latest Reply
longcao
New Contributor III
  • 0 kudos

In case anyone was curious how I worked around this, I ended up dropping down to Postgres JDBC and using CopyManager to COPY rows in directly from Spark: https://gist.github.com/longcao/bb61f1798ccbbfa4a0d7b76e49982f84

  • 0 kudos
4 More Replies
UmeshKacha
by New Contributor II
  • 8654 Views
  • 3 replies
  • 0 kudos

How to avoid empty/null keys in DataFrame groupby?

Hi I have Spark job which does group by and I cant avoid it because of my use case. I have large dataset around 1 TB which I need to process/update in DataFrame. Now my jobs shuffles huge data and slows things because of shuffling and groupby. One r...

  • 8654 Views
  • 3 replies
  • 0 kudos
Latest Reply
silvio
New Contributor II
  • 0 kudos

Hi Umesh,If you want to completely ignore the null/empty values then you could simply filter before you do the groupBy, but are you wanting to keep those values?If you want to keep the null values and avoid the skew, you could try splitting the DataF...

  • 0 kudos
2 More Replies
MarcLimotte
by New Contributor II
  • 22432 Views
  • 12 replies
  • 0 kudos

Why do I get 'java.io.IOException: File already exists' for saveAsTable with Overwrite mode?

I have a fairly small, simple DataFrame, month:month.schema org.apache.spark.sql.types.StructType = StructType(StructField(month,DateType,true), StructField(real_month,TimestampType,true), StructField(month_millis,LongType,true))The month Dataframe i...

  • 22432 Views
  • 12 replies
  • 0 kudos
Latest Reply
ReKa
New Contributor III
  • 0 kudos

Your schema is tight, but make sure that the conversion to it does not throw an exception. Try with Memory Optimized Nodes, you may be fine. My problem was parsing a lot of data from sequence files containing 10K xml files and saving them as a table...

  • 0 kudos
11 More Replies
washim
by New Contributor III
  • 10051 Views
  • 1 replies
  • 0 kudos
  • 10051 Views
  • 1 replies
  • 0 kudos
Latest Reply
washim
New Contributor III
  • 0 kudos

got it use - features = dataset.map(lambda row: row[0:]) from pyspark.mllib.stat import Statistics corr_mat=Statistics.corr(features, method="pearson")

  • 0 kudos
Labels