cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Forum Posts

LaurentThiebaud
by New Contributor
  • 6891 Views
  • 1 replies
  • 0 kudos

Sort within a groupBy with dataframe

Using Spark DataFrame, eg. myDf .filter(col("timestamp").gt(15000)) .groupBy("groupingKey") .agg(collect_list("aDoubleValue")) I want the collect_list to return the result, but ordered according to "timestamp". i.a. I want the GroupBy results...

  • 6891 Views
  • 1 replies
  • 0 kudos
Latest Reply
shyam_9
Databricks Employee
  • 0 kudos

Hi @Laurent Thiebaud,Please use the below format to sort within a groupby, import org.apache.spark.sql.functions._ df.groupBy("columnA").agg(sort_array(collect_list("columnB")))

  • 0 kudos
RohiniMathur
by New Contributor II
  • 19137 Views
  • 1 replies
  • 0 kudos

Resolved! Length Value of a column in pyspark

Hello, i am using pyspark 2.12 After Creating Dataframe can we measure the length value for each row. For Example: I am measuring length of a value in column 2 Input file |TYCO|1303| |EMC |120989| |VOLVO|102329| |BMW|130157| |FORD|004| Output ...

  • 19137 Views
  • 1 replies
  • 0 kudos
Latest Reply
lee
Contributor
  • 0 kudos

You can use the length function for this from pyspark.sql.functions import length mock_data = [('TYCO', '1303'),('EMC', '120989'), ('VOLVO', '102329'),('BMW', '130157'),('FORD', '004')] df = spark.createDataFrame(mock_data, ['col1', 'col2']) df2 = d...

  • 0 kudos
RohiniMathur
by New Contributor II
  • 24404 Views
  • 4 replies
  • 0 kudos

Removing non-ascii and special character in pyspark

i am running spark 2.4.4 with python 2.7 and IDE is pycharm. The Input file (.csv) contain encoded value in some column like given below. File data looks COL1,COL2,COL3,COL4 CM, 503004, (d$όνυ$F|'.h*Λ!ψμ=(.ξ; ,.ʽ|!3-2-704 The output i am trying ...

  • 24404 Views
  • 4 replies
  • 0 kudos
Latest Reply
shyam_9
Databricks Employee
  • 0 kudos

Hi @Rohini Mathur, use below code on column containing non-ascii and special characters.df['column_name'].str.encode('ascii', 'ignore').str.decode('ascii')

  • 0 kudos
3 More Replies
akj2784
by New Contributor II
  • 9523 Views
  • 5 replies
  • 0 kudos

How to create a dataframe with the files from S3 bucket

I have connected my S3 bucket from databricks. Using the following command : import urllib import urllib.parse ACCESS_KEY = "Test" SECRET_KEY = "Test" ENCODED_SECRET_KEY = urllib.parse.quote(SECRET_KEY, "") AWS_BUCKET_NAME = "Test" MOUNT_NAME = "...

  • 9523 Views
  • 5 replies
  • 0 kudos
Latest Reply
shyam_9
Databricks Employee
  • 0 kudos

Hi @akj2784,Please go through Databricks documentation on working with files in S3,https://docs.databricks.com/spark/latest/data-sources/aws/amazon-s3.html#mount-s3-buckets-with-dbfs

  • 0 kudos
4 More Replies
vaio
by New Contributor II
  • 8391 Views
  • 6 replies
  • 0 kudos

Convert String to Timestamp

I have a dataset with one column of string type ('2014/12/31 18:00:36'). How can I convert it to timastamp type with PySpark?

  • 8391 Views
  • 6 replies
  • 0 kudos
Latest Reply
gideon
New Contributor II
  • 0 kudos

hope you dont mind if i ask you to elaborate further for a shaper understanding? see my basketball court layout at https://www.recreationtipsy.com/basketball-court/

  • 0 kudos
5 More Replies
Raymond_Hu
by New Contributor
  • 16150 Views
  • 1 replies
  • 0 kudos

ConnectException error

I'm using PySpark on Databricks and trying to pivot a 27753444 X 3 matrix. If I do it in Spark DataFrame: df = df.groupBy("A").pivot("B").avg("C") it takes forever (after 2 hours and I canceled it). If I convert it to pandas dataframe and then pivo...

  • 16150 Views
  • 1 replies
  • 0 kudos
Latest Reply
shyam_9
Databricks Employee
  • 0 kudos

Hi @Raymond_Hu,This means that the driver crashed because of an OOM (Out of memory) exception and after that, it's not able to establish a new connection with the driver. Please try below optionsTry increasing driver-side memory and then retry.You ca...

  • 0 kudos
msj50
by New Contributor III
  • 14126 Views
  • 10 replies
  • 1 kudos

Spark Running Really slow - help required

My company urgently needs help, we are having severe performance problems with spark and are having to switch to a different solution if we don't get to the bottom of it. We are on 1.3.1, using spark SQL, ORC Files with partitions and caching in me...

  • 14126 Views
  • 10 replies
  • 1 kudos
Latest Reply
Marco
New Contributor II
  • 1 kudos

In my project, following solutions were launched one-by-one to improve performance To store middle-level result, use memory cache instead of HDFS (like: Ignite Cache) Only use spark for complicated data aggregation, to simple result, just do it on d...

  • 1 kudos
9 More Replies
Yogi
by New Contributor III
  • 15642 Views
  • 15 replies
  • 0 kudos

Resolved! Can we pass Databricks output to Azure function body?

Hi, Can anyone help me with Databricks and Azure function. I'm trying to pass databricks json output to azure function body in ADF job, is it possible? If yes, How? If No, what other alternative to do the same?

  • 15642 Views
  • 15 replies
  • 0 kudos
Latest Reply
AbhishekNarain_
New Contributor III
  • 0 kudos

You can now pass values back to ADF from a notebook.@@Yogi​ Though there is a size limit, so if you are passing dataset of larger than 2MB then rather write it on storage, and consume it directly with Azure Functions. You can pass the file path/ refe...

  • 0 kudos
14 More Replies
sobhan
by New Contributor II
  • 10019 Views
  • 3 replies
  • 0 kudos

How can I write Pandas dataframe into avro

I am trying to write Pandas core dataframe into avro format as below. But I get the following error: AttributeError: 'DataFrame' object has no attribute 'write' I have tried several options as below: df_2018_pd.write.format("com.databricks.spark.avr...

  • 10019 Views
  • 3 replies
  • 0 kudos
Latest Reply
Brayden_Cook
New Contributor II
  • 0 kudos

Very complicated question. I think you can get your answer on online sites. There are many online providers like managements writing solutions whose experts provide online help for every type of research paper. I got a lot of assistance from them. No...

  • 0 kudos
2 More Replies
AdityaDeshpande
by New Contributor II
  • 6189 Views
  • 2 replies
  • 0 kudos

How to maintain Primary Key Column in Databricks Delta Multi Cluster environment

I am trying to replicate the SQL DB like feature of maintaining the Primary Keys in Databrciks Delta approach where the data is being written to Blob Storage such as ADLS2 oe AWS S3. I want a Auto Incremented Primary key feature using Databricks Del...

  • 6189 Views
  • 2 replies
  • 0 kudos
Latest Reply
girivaratharaja
New Contributor III
  • 0 kudos

Hi @Aditya Deshpande​ There is no locking mechanism of PK in Delta. You can use row_number() function on the df and save using delta and do a distinct() before the write.

  • 0 kudos
1 More Replies
xxMathieuxxZara
by New Contributor
  • 8060 Views
  • 6 replies
  • 0 kudos

Parquet file merging or other optimisation tips

Hi, I need some guide lines for a performance issue with Parquet files : I am loading a set of parquet files using : df = sqlContext.parquetFile( folder_path ) My parquet folder has 6 sub division keys It was initially ok with a first sample of data...

  • 8060 Views
  • 6 replies
  • 0 kudos
Latest Reply
User16301467532
New Contributor II
  • 0 kudos

Having a large # of small files or folders can significantly deteriorate the performance of loading the data. The best way is to keep the folders/files merged so that each file is around 64MB size. There are different ways to achieve this: your writ...

  • 0 kudos
5 More Replies
Maser_AZ
by New Contributor II
  • 18666 Views
  • 1 replies
  • 0 kudos

NameError: name 'col' is not defined

I m executing the below code and using Pyhton in notebook and it appears that the col() function is not getting recognized . I want to know if the col() function belongs to any specific Dataframe library or Python library .I dont want to use pyspark...

  • 18666 Views
  • 1 replies
  • 0 kudos
Latest Reply
MOHAN_KUMARL_N
New Contributor II
  • 0 kudos

@mudassar45@gmail.com as the document describe generic column not yet associated. Please refer the below code. display(peopleDF.select("firstName").filter("firstName = 'An'"))

  • 0 kudos
AnilKumar
by New Contributor II
  • 12378 Views
  • 4 replies
  • 0 kudos

How to solve column header issues in Spark SQL data frame

My code : val name = sc.textFile("/FileStore/tables/employeenames.csv") case class x(ID:String,Employee_name:String) val namePairRDD = name.map(_.split(",")).map(x => (x(0), x(1).trim.toString)).toDF("ID", "Employee_name") namePairRDD.createOrRe...

0693f000007OoHrAAK
  • 12378 Views
  • 4 replies
  • 0 kudos
Latest Reply
evan_matthews1
New Contributor II
  • 0 kudos

Hi, I have the opposite issue. When I run and SQL query through the bulk download as per the standard prc fobasx notebook, the first row of data somehow gets attached to the column headers. When I import the csv file into R using read_csv, R thinks ...

  • 0 kudos
3 More Replies
mashaye
by New Contributor
  • 26919 Views
  • 6 replies
  • 2 kudos

How can I call a stored procedure in Spark Sql?

I have seen the following code: val url = "jdbc:mysql://yourIP:yourPort/test? user=yourUsername; password=yourPassword" val df = sqlContext .read .format("jdbc") .option("url", url) .option("dbtable", "people") .load() But I ...

  • 26919 Views
  • 6 replies
  • 2 kudos
Latest Reply
j500sut
New Contributor III
  • 2 kudos

This doesn't seem to be supported. There is an alternative but requires using pyodbc and adding to your init script. Details can be found here: https://datathirst.net/blog/2018/10/12/executing-sql-server-stored-procedures-on-databricks-pyspark I hav...

  • 2 kudos
5 More Replies
tripplehay777
by New Contributor
  • 19008 Views
  • 1 replies
  • 0 kudos

How can I create a Table from a CSV file with first column with data in dictionary format (JSON like)?

I have a csv file with the first column containing data in dictionary form (keys: value). [see below] I tried to create a table by uploading the csv file directly to databricks but the file can't be read. Is there a way for me to flatten or conver...

0693f000007OoIpAAK
  • 19008 Views
  • 1 replies
  • 0 kudos
Latest Reply
MaxStruever
New Contributor II
  • 0 kudos

This is apparently a known issue, databricks has their own csv format handler which can handle this https://github.com/databricks/spark-csv SQL API CSV data source for Spark can infer data types: CREATE TABLE cars USING com.databricks.spark.csv OP...

  • 0 kudos

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!

Sign Up Now
Labels