Data Engineering

Forum Posts

Sorted by:

by AnandJ_Kadhi • New Contributor II

08-18-2017 5:47:44 AM

7150 Views
2 replies
1 kudos

Handle comma inside cell of CSV

We are using spark-csv_2.10 > version 1.5.0 and reading the csv file column which contains comma " , " as one of the character. The problem we are facing is like that it treats the rest of line after the comma as new column and data is not interpre...

Data Engineering

7150 Views
2 replies
1 kudos

08-18-2017 5:47:44 AM

View Replies

Latest Reply

User16857282152
Contributor

11-01-2019 10:27:53 AM

1 kudos

Take a look here for options, http://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=dataframereader#pyspark.sql.DataFrameReader.csv If a csv file has commas then the tradition is to quote the string that contains the comma, In ...

1 kudos

11-01-2019 10:27:53 AM

1 More Replies

by SwapanSwapandee • New Contributor II

10-26-2019 8:28:02 PM

9171 Views
2 replies
0 kudos

How to pass column names in selectExpr through one or more string parameters in spark using scala?

I am using script for CDC Merge in spark streaming. I wish to pass column values in selectExpr through a parameter as column names for each table would change. When I pass the columns and struct field through a string variable, I am getting error as...

Data Engineering

9171 Views
2 replies
0 kudos

10-26-2019 8:28:02 PM

View Replies

Latest Reply

shyam_9
Databricks Employee

10-28-2019 10:40:48 PM

0 kudos

Hi @Swapan Swapandeep Marwaha, Can you pass them as a Seq as in below code, keyCols = Seq("col1", "col2"), structCols = Seq("struct(offset,KAFKA_TS) as otherCols")

0 kudos

10-28-2019 10:40:48 PM

1 More Replies

by CaioIshizaka_Co • New Contributor

10-23-2019 11:10:46 AM

6028 Views
1 replies
0 kudos

Making HTTP post requests on Spark using foreachPartition

Need some help to understand the behaviour of the below in Spark (using Scala and Databricks) I have some dataframe (reading from S3 if that matters), and would send that data by making HTTP post requests in batches of 1000 (at most). So I reparti...

Data Engineering

6028 Views
1 replies
0 kudos

10-23-2019 11:10:46 AM

View Replies

Latest Reply

melo08
New Contributor II

10-27-2019 3:14:08 PM

0 kudos

0 kudos

10-27-2019 3:14:08 PM

by rba76 • New Contributor

10-21-2019 1:42:56 AM

21223 Views
2 replies
0 kudos

Python spark.read.text Path does not exist

Dear all, I want to read files with python from a storage account. I followed this instruction https://docs.microsoft.com/en-us/azure/azure-databricks/store-secrets-azure-key-vault. This is my python code: dbutils.fs.mount(source = "wasbs://contain...

Data Engineering

21223 Views
2 replies
0 kudos

10-21-2019 1:42:56 AM

View Replies

Latest Reply

PRADEEPCHEEKATL
New Contributor II

10-25-2019 3:30:52 AM

0 kudos

@rba76 Make sure helloworld.txt file exists in the container1 folderI'm able to view the text file using the same commands as follows:Mount Blob Storage:dbutils.fs.mount( source = "wasbs://sampledata@azure.blob.core.windows.net/Azure", mount_po...

0 kudos

10-25-2019 3:30:52 AM

1 More Replies

by desai_n_3 • New Contributor II

10-03-2019 1:41:05 AM

17134 Views
6 replies
0 kudos

Cannot Convert Column to Bool Error - When Converting dataframe column which is in string to date type in python

Hi All, I am trying to convert a dataframe column which is in the format of string to date type format yyyy-MM-DD? I have written a sql query and stored it in dataframe. df3 = sqlContext.sql(sqlString2) df3.withColumn(df3['CalDay'],pd.to_datetime(df...

Data Engineering

17134 Views
6 replies
0 kudos

10-03-2019 1:41:05 AM

View Replies

Latest Reply

JoshuaJames
New Contributor II

10-24-2019 11:06:21 PM

0 kudos

Registered to post this so forgive the formatting nightmare This is a python databricks script function that allows you to convert from string to datetime or date and utilising coalescefrom pyspark.sql.functions import coalesce, to_date def to_dat...

0 kudos

10-24-2019 11:06:21 PM

5 More Replies

by dbansal • New Contributor

10-14-2019 12:29:00 PM

15878 Views
1 replies
0 kudos

How can I add jars ("spark.jars") to pyspark notebook?

I want to add a few custom jars to the spark conf. Typically they would be submitted along with the spark-submit command but in Databricks notebook, the spark session is already initialized. So, I want to set the jars in "spark.jars" property in the...

Data Engineering

15878 Views
1 replies
0 kudos

10-14-2019 12:29:00 PM

View Replies

Latest Reply

shyam_9
Databricks Employee

10-14-2019 11:05:00 PM

0 kudos

Hi @dbansal, Install the libraries/jars while initialising the cluster.Please go through the documentation on the same below,https://docs.databricks.com/libraries.html#upload-a-jar-python-egg-or-python-wheel

0 kudos

10-14-2019 11:05:00 PM

by asher • New Contributor II

06-27-2019 11:09:19 AM

9830 Views
1 replies
0 kudos

List all files in a Blob Container

I am trying to find a way to list all files, and related file sizes, in all folders and all sub folders. I guess these are called blobs, in the Databricks world. Anyway, I can easily list all files, and related file sizes, in one single folder, but ...

Data Engineering

9830 Views
1 replies
0 kudos

06-27-2019 11:09:19 AM

View Replies

Latest Reply

asher
New Contributor II

10-14-2019 1:38:26 PM

0 kudos

from azure.storage.blob import BlockBlobService block_blob_service = BlockBlobService(account_name='your_acct_name', account_key='your_acct_key') mylist = [] generator = block_blob_service.list_blobs('rawdata') for blob in generator: mylist.append(...

0 kudos

10-14-2019 1:38:26 PM

by ammobear • New Contributor III

02-11-2019 3:43:52 PM

82575 Views
11 replies
6 kudos

Resolved! How do I get the current cluster id?

I am adding Application Insights telemetry to my Databricks jobs and would like to include the cluster ID of the job run. How can I access the cluster id at run time? The requirement is that my job can programmatically retrieve the cluster id to in...

Data Engineering

82575 Views
11 replies
6 kudos

02-11-2019 3:43:52 PM

View Replies

Latest Reply

EricBellet
New Contributor III

10-08-2019 9:06:35 AM

6 kudos

I fixed, it should be "'$DB_CLUSTER_ID'"

6 kudos

10-08-2019 9:06:35 AM

10 More Replies

by LaurentThiebaud • New Contributor

10-07-2019 12:01:20 AM

6897 Views
1 replies
0 kudos

Sort within a groupBy with dataframe

Using Spark DataFrame, eg. myDf .filter(col("timestamp").gt(15000)) .groupBy("groupingKey") .agg(collect_list("aDoubleValue")) I want the collect_list to return the result, but ordered according to "timestamp". i.a. I want the GroupBy results...

Data Engineering

6897 Views
1 replies
0 kudos

10-07-2019 12:01:20 AM

View Replies

Latest Reply

shyam_9
Databricks Employee

10-07-2019 1:43:59 AM

0 kudos

Hi @Laurent Thiebaud,Please use the below format to sort within a groupby, import org.apache.spark.sql.functions._ df.groupBy("columnA").agg(sort_array(collect_list("columnB")))

0 kudos

10-07-2019 1:43:59 AM

by RohiniMathur • New Contributor II

09-23-2019 11:03:39 AM

19151 Views
1 replies
0 kudos

Resolved! Length Value of a column in pyspark

Hello, i am using pyspark 2.12 After Creating Dataframe can we measure the length value for each row. For Example: I am measuring length of a value in column 2 Input file |TYCO|1303| |EMC |120989| |VOLVO|102329| |BMW|130157| |FORD|004| Output ...

Data Engineering

19151 Views
1 replies
0 kudos

09-23-2019 11:03:39 AM

View Replies

Latest Reply

lee
Contributor

09-23-2019 11:13:06 AM

0 kudos

You can use the length function for this from pyspark.sql.functions import length mock_data = [('TYCO', '1303'),('EMC', '120989'), ('VOLVO', '102329'),('BMW', '130157'),('FORD', '004')] df = spark.createDataFrame(mock_data, ['col1', 'col2']) df2 = d...

0 kudos

09-23-2019 11:13:06 AM

by RohiniMathur • New Contributor II

09-23-2019 12:16:16 AM

24415 Views
4 replies
0 kudos

Removing non-ascii and special character in pyspark

i am running spark 2.4.4 with python 2.7 and IDE is pycharm. The Input file (.csv) contain encoded value in some column like given below. File data looks COL1,COL2,COL3,COL4 CM, 503004, (d$όνυ$F|'.h*Λ!ψμ=(.ξ; ,.ʽ|!3-2-704 The output i am trying ...

Data Engineering

24415 Views
4 replies
0 kudos

09-23-2019 12:16:16 AM

View Replies

Latest Reply

shyam_9
Databricks Employee

09-23-2019 12:57:02 AM

0 kudos

Hi @Rohini Mathur, use below code on column containing non-ascii and special characters.df['column_name'].str.encode('ascii', 'ignore').str.decode('ascii')

0 kudos

09-23-2019 12:57:02 AM

3 More Replies

by akj2784 • New Contributor II

09-19-2019 12:05:10 AM

9531 Views
5 replies
0 kudos

How to create a dataframe with the files from S3 bucket

I have connected my S3 bucket from databricks. Using the following command : import urllib import urllib.parse ACCESS_KEY = "Test" SECRET_KEY = "Test" ENCODED_SECRET_KEY = urllib.parse.quote(SECRET_KEY, "") AWS_BUCKET_NAME = "Test" MOUNT_NAME = "...

Data Engineering

9531 Views
5 replies
0 kudos

09-19-2019 12:05:10 AM

View Replies

Latest Reply

shyam_9
Databricks Employee

09-19-2019 12:13:35 AM

0 kudos

Hi @akj2784,Please go through Databricks documentation on working with files in S3,https://docs.databricks.com/spark/latest/data-sources/aws/amazon-s3.html#mount-s3-buckets-with-dbfs

0 kudos

09-19-2019 12:13:35 AM

4 More Replies

by vaio • New Contributor II

11-18-2017 10:58:18 AM

8407 Views
6 replies
0 kudos

Convert String to Timestamp

I have a dataset with one column of string type ('2014/12/31 18:00:36'). How can I convert it to timastamp type with PySpark?

Data Engineering

8407 Views
6 replies
0 kudos

11-18-2017 10:58:18 AM

View Replies

Latest Reply

gideon
New Contributor II

09-19-2019 2:31:20 AM

0 kudos

hope you dont mind if i ask you to elaborate further for a shaper understanding? see my basketball court layout at https://www.recreationtipsy.com/basketball-court/

0 kudos

09-19-2019 2:31:20 AM

5 More Replies

by Raymond_Hu • New Contributor

09-16-2019 1:25:03 PM

16157 Views
1 replies
0 kudos

ConnectException error

I'm using PySpark on Databricks and trying to pivot a 27753444 X 3 matrix. If I do it in Spark DataFrame: df = df.groupBy("A").pivot("B").avg("C") it takes forever (after 2 hours and I canceled it). If I convert it to pandas dataframe and then pivo...

Data Engineering

16157 Views
1 replies
0 kudos

09-16-2019 1:25:03 PM

View Replies

Latest Reply

shyam_9
Databricks Employee

09-17-2019 3:59:31 AM

0 kudos

Hi @Raymond_Hu,This means that the driver crashed because of an OOM (Out of memory) exception and after that, it's not able to establish a new connection with the driver. Please try below optionsTry increasing driver-side memory and then retry.You ca...

0 kudos

09-17-2019 3:59:31 AM

by msj50 • New Contributor III

05-29-2015 7:49:19 AM

14139 Views
10 replies
1 kudos

Spark Running Really slow - help required

My company urgently needs help, we are having severe performance problems with spark and are having to switch to a different solution if we don't get to the bottom of it. We are on 1.3.1, using spark SQL, ORC Files with partitions and caching in me...

Data Engineering

14139 Views
10 replies
1 kudos

05-29-2015 7:49:19 AM

View Replies

Latest Reply

Marco
New Contributor II

11-02-2015 2:29:30 PM

1 kudos

In my project, following solutions were launched one-by-one to improve performance To store middle-level result, use memory cache instead of HDFS (like: Ignite Cache) Only use spark for complicated data aggregation, to simple result, just do it on d...

1 kudos

11-02-2015 2:29:30 PM

9 More Replies

Databricks Community

Forum Posts

Handle comma inside cell of CSV

How to pass column names in selectExpr through one or more string parameters in spark using scala?

Making HTTP post requests on Spark using foreachPartition

Python spark.read.text Path does not exist

Cannot Convert Column to Bool Error - When Converting dataframe column which is in string to date type in python

How can I add jars ("spark.jars") to pyspark notebook?

List all files in a Blob Container

Resolved! How do I get the current cluster id?

Sort within a groupBy with dataframe

Resolved! Length Value of a column in pyspark

Removing non-ascii and special character in pyspark

How to create a dataframe with the files from S3 bucket

Convert String to Timestamp

ConnectException error

Spark Running Really slow - help required

Join Us as a Local Community Builder!

Figure out stale tables/folders being loaded by au...

Cannot import pyspark.pipelines module

How to make FOR cycle and dynamic SQL and variable...

Problem in VS Code Extention

TransformWithState is not emitting for live stream...