Data Engineering

Forum Posts

Sorted by:

by senthilkumar • New Contributor

01-16-2017 6:42:09 AM

24358 Views
1 replies
0 kudos

How filter condition working in spark dataframe?

I have a table in hbase with 1 billions records.I want to filter the records based on certain condition (by date). For example: Dataframe.filter(col(date) === todayDate) Filter will be applied after all records from the table will be loaded into me...

Data Engineering

24358 Views
1 replies
0 kudos

01-16-2017 6:42:09 AM

View Replies

Latest Reply

muk1
New Contributor II

12-19-2018 2:11:07 AM

0 kudos

Hello @senthil kumar To pass external values to the filter (or where) transformations you can use the "lit" function in the following way:Dataframe.filter(col(date) == lit(todayDate))don´t know if that helps. Be careful with the schema infered by th...

0 kudos

12-19-2018 2:11:07 AM

by DominicRobinson • New Contributor II

12-11-2018 12:13:13 PM

18205 Views
4 replies
0 kudos

Issues with UTF-16 files and unicode characters

Can someone please offer some insight - I've spent days trying to solve this issue We have the task of loading in hundreds of tab seperated text files encoded in UTF-16 little endian with a tab delimiter. Our organisation is an international one and...

Data Engineering

18205 Views
4 replies
0 kudos

12-11-2018 12:13:13 PM

View Replies

Latest Reply

User16817872376
New Contributor III

12-12-2018 2:05:09 PM

0 kudos

You can also always read in the file as a textFile, and then run a UTF-16 decoder/encoder library as a UDF on the text.

0 kudos

12-12-2018 2:05:09 PM

3 More Replies

by Tamara • New Contributor III

11-03-2015 4:01:50 AM

15815 Views
8 replies
2 kudos

Resolved! Can I connect to a MS SQL server table in Databricks account?

I'd like to access a table on a MS SQL Server (Microsoft). Is it possible from Databricks? To my understanding, the syntax is something like this (in a SQL Notebook): CREATE TEMPORARY TABLE jdbcTable USING org.apache.spark.sql.jdbc OPTIONS ( url...

Data Engineering

15815 Views
8 replies
2 kudos

11-03-2015 4:01:50 AM

View Replies

Latest Reply

JohnSmith091
New Contributor II

11-27-2018 1:19:31 AM

2 kudos

Thanks for the trick that you have shared with us. I am really amazed to use this informational post. If you are facing MacBook error like MacBook Pro won't turn on black screen then click the link.

2 kudos

11-27-2018 1:19:31 AM

7 More Replies

by juan_perez • New Contributor

08-03-2018 7:00:20 AM

15824 Views
2 replies
0 kudos

Write data Frame into Azure Data Lake Storage

It happens that I am manipulating some data using Azure Databricks. Such data is in an Azure Data Lake Storage Gen1. I mounted the data into DBFS, but now, after transforming the data I would like to write it back into my data lake. To mount the dat...

Data Engineering

15824 Views
2 replies
0 kudos

08-03-2018 7:00:20 AM

View Replies

Latest Reply

PawanShukla
New Contributor III

09-29-2018 3:36:27 AM

0 kudos

I am new in Azure Data Bricks..and I am trying to write the Data frame in mounted ADLS file. But in below command dfGPS.write.mode("overwrite").format("com.databricks.spark.csv").option("header","true").csv("/mnt/<mount-name>")

0 kudos

09-29-2018 3:36:27 AM

1 More Replies

by SatheesshChinnu • New Contributor III

02-11-2017 5:34:17 PM

13329 Views
4 replies
0 kudos

Resolved! Error: TransportResponseHandler: Still have 1 requests outstanding when connection, occurring only on large dataset.

I am getting below error only during large dataset(i.e 15 TB compressed) . if my dataset is small( 1TB) i am not getting this error. Look like it fails on shuffle stage. Approx number of mappers is 150,000 Spark config:spark.sql.warehouse.dir hdfs:...

Data Engineering

13329 Views
4 replies
0 kudos

02-11-2017 5:34:17 PM

View Replies

Latest Reply

parikshitbhoyar
New Contributor II

09-03-2018 2:20:26 AM

0 kudos

@Satheessh Chinnusamy how did you solve the above issue

0 kudos

09-03-2018 2:20:26 AM

3 More Replies

by WajdiFATHALLAH • New Contributor

05-18-2017 2:18:23 AM

20576 Views
4 replies
0 kudos

Writing large parquet file (500 millions row / 1000 columns) to S3 takes too much time

Hello community,First let me introduce my use case, i daily receive a 500 million rows like so :ID | Categories 1 | cat1, cat2, cat3, ..., catn 2 | cat1, catx, caty, ..., anothercategory Input data: 50 compressed csv files each file is 250 MB ...

Data Engineering

20576 Views
4 replies
0 kudos

05-18-2017 2:18:23 AM

View Replies

Latest Reply

EliasHaydar
New Contributor II

08-13-2018 5:16:32 AM

0 kudos

So you are basically creating an inverted index ?

0 kudos

08-13-2018 5:16:32 AM

3 More Replies

by z160896 • New Contributor II

08-06-2018 8:37:52 AM

9937 Views
2 replies
0 kudos

why spark very slow with large number of dataframe columns

scala Spark App: I have a dataset of 130x14000. I read from a parquet file with SparkSession. Then used for Spark ML Random Forest model (using pipeline). It takes 7 hours to complete! for reading the parquet file takes about 1 minute. If I implemen...

Data Engineering

9937 Views
2 replies
0 kudos

08-06-2018 8:37:52 AM

View Replies

Latest Reply

EliasHaydar
New Contributor II

08-13-2018 5:11:26 AM

0 kudos

I've already answered a similar question on StackOverflow so I'll repeat what a I said there. The following may not solve your problem completely but it should give you some pointer to start. The first problem that you are facing is the disproportio...

0 kudos

08-13-2018 5:11:26 AM

1 More Replies

by vin007 • New Contributor

08-02-2018 12:09:24 AM

8463 Views
1 replies
0 kudos

How to store a pyspark dataframe in S3 bucket.

I have a pyspark dataframe df containing 4 columns. How can I write this dataframe to s3 bucket? I'm using pycharm to execute the code. and what are the packages required to be installed?

Data Engineering

8463 Views
1 replies
0 kudos

08-02-2018 12:09:24 AM

View Replies

Latest Reply

AndrewSears
New Contributor III

08-04-2018 4:16:04 AM

0 kudos

You shouldn't need any packages. You can mount S3 bucket to Databricks cluster. https://docs.databricks.com/spark/latest/data-sources/aws/amazon-s3.html#mount-aws-s3 or this http://www.sparktutorials.net/Reading+and+Writing+S3+Data+with+Apache+Spark...

0 kudos

08-04-2018 4:16:04 AM

by SiddarthaPaturu • New Contributor II

03-31-2016 1:53:51 PM

33521 Views
8 replies
1 kudos

Resolved! Comparing two dataframes

How can we compare two data frames using pyspark I need to validate my output with another dataset

Data Engineering

33521 Views
8 replies
1 kudos

03-31-2016 1:53:51 PM

View Replies

Latest Reply

sbharti
New Contributor II

06-28-2018 6:53:44 AM

1 kudos

I think the best bet in such a case is to take inner join (equivalent to intersection) by putting a condition on those columns which necessarily need to have same value in both dataframes. For example, let df1 and df2 are two dataframes. df1 has co...

1 kudos

06-28-2018 6:53:44 AM

7 More Replies

by mlm • New Contributor

11-02-2015 10:43:25 AM

15897 Views
5 replies
0 kudos

How to prevent spark-csv from adding quotes to JSON string in dataframe

I have a sql dataframe with a column that has a json string in it (e.g. {"key":"value"}). When I use spark-csv to save the dataframe it changes the field values to be "{""key"":""valule""}". Is there a way to turn that off?

Data Engineering

15897 Views
5 replies
0 kudos

11-02-2015 10:43:25 AM

View Replies

Latest Reply

AshleyPan
New Contributor II

06-14-2018 11:11:44 AM

0 kudos

Do quote or escape options only work with "Write" instead of "read"? Our source files contain doube quotes. We'd like to add backsplash (escape) in front each double quote before converting the values from out dataframes to json outputs.

0 kudos

06-14-2018 11:11:44 AM

4 More Replies

by bkr • New Contributor

06-08-2018 6:29:45 AM

6853 Views
1 replies
0 kudos

How to move files of same extension in databricks files system?

I am facing file not found exception when i am trying to move the file with * in DBFS. Here both source and destination directories are in DBFS. I have the source file named "test_sample.csv" available in dbfs directory and i am using the command li...

Data Engineering

6853 Views
1 replies
0 kudos

06-08-2018 6:29:45 AM

View Replies

Latest Reply

ricardo_portill
Databricks Employee

06-08-2018 9:45:33 AM

0 kudos

@bkr, you can reference the file name using dbutils and then pass this to the move command. Here's an example for this in Scala: val fileNm = dbutils.fs.ls("/usr/krishna/sample").map(_.name).filter(r => r.startsWith("test"))(0) val fileLoc = "dbfs:/...

0 kudos

06-08-2018 9:45:33 AM

by rlgarris • Databricks Employee

02-10-2016 10:07:06 AM

9792 Views
5 replies
0 kudos

Resolved! How do I get a cartesian product of a huge dataset?

A cartesian product is a common operation to get the cross product of two tables. For example, say you have a list of customers and a list of your product catalog and want to get the cross product of all customer - product combinations. Cartesian pr...

Data Engineering

9792 Views
5 replies
0 kudos

02-10-2016 10:07:06 AM

View Replies

Latest Reply

Forum_Admin
Contributor

05-10-2018 2:12:21 AM

0 kudos

Hi buddies, it is great written piece entirely defined, continue the good work constantly.

0 kudos

05-10-2018 2:12:21 AM

4 More Replies

by Mahesha999 • New Contributor II

04-27-2018 5:52:00 AM

6161 Views
3 replies
0 kudos

Resolving NoClassDefFoundError: org/apache/spark/Logging exception

I was trying out hbase-spark connector. To start with, I am trying out this code. My pom dependencies are: <dependencies> <dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-core_2.11</artifactId> <version...

Data Engineering

6161 Views
3 replies
0 kudos

04-27-2018 5:52:00 AM

View Replies

Latest Reply

User16301467518
New Contributor II

04-27-2018 7:36:40 AM

0 kudos

The alpha of hbase-spark you're using depends on Spark 1.6 -- see hbase-spark/pom.xml:L33 -- so you'll probably have to stick with 1.6 if you want to use that published jar. For reasons I don't understand hbase-spark was removed in the last couple o...

0 kudos

04-27-2018 7:36:40 AM

2 More Replies

by semihcandoken • New Contributor

08-18-2016 9:29:07 PM

17702 Views
4 replies
0 kudos

How to convert column type from str to date in sparksql when the format is not yyyy-mm-dd?

I imported a large csv file into databricks as a table. I am able to run sql queries on it in a databricks notebook. In my table, I have a column that contains date information in the mm/dd/yyyy format : 12/29/2015 12/30/2015 etc... Databricks impo...

Data Engineering

17702 Views
4 replies
0 kudos

08-18-2016 9:29:07 PM

View Replies

Latest Reply

ShubhamGupta187
New Contributor II

04-19-2018 9:37:52 PM

0 kudos

@josephpconley would it be safe to cast a column that contains null values?

0 kudos

04-19-2018 9:37:52 PM

3 More Replies

by Young_TackPark • New Contributor

01-06-2017 11:22:20 PM

19961 Views
2 replies
0 kudos

upload local files into DBFS

I am using Databricks Notebook Community Edition (2.36) and want to upload a local file into DBFS. Is there any simple Hadoop commands like "hadoop fs -put ..."? Any help would be appreciated.

Data Engineering

19961 Views
2 replies
0 kudos

01-06-2017 11:22:20 PM

View Replies

Latest Reply

sushrutt_12
New Contributor II

03-15-2018 9:07:08 AM

0 kudos

Python 2.7:import urllib.request urllib.urlretrieve("https://github.com/sushrutt12/DataSets/blob/master/final_chris.zip","/tmp/chris_data.zip") dbutils.fs.mv("file:/tmp/chris_data.zip", "dbfs:/data/chris_data.zip")Python 3.x: import urllib.requesturl...

0 kudos

03-15-2018 9:07:08 AM

1 More Replies

Databricks Community

Forum Posts

How filter condition working in spark dataframe?

Issues with UTF-16 files and unicode characters

Resolved! Can I connect to a MS SQL server table in Databricks account?

Write data Frame into Azure Data Lake Storage

Resolved! Error: TransportResponseHandler: Still have 1 requests outstanding when connection, occurring only on large dataset.

Writing large parquet file (500 millions row / 1000 columns) to S3 takes too much time

why spark very slow with large number of dataframe columns

How to store a pyspark dataframe in S3 bucket.

Resolved! Comparing two dataframes

How to prevent spark-csv from adding quotes to JSON string in dataframe

How to move files of same extension in databricks files system?

Resolved! How do I get a cartesian product of a huge dataset?

Resolving NoClassDefFoundError: org/apache/spark/Logging exception

How to convert column type from str to date in sparksql when the format is not yyyy-mm-dd?

upload local files into DBFS

Join Us as a Local Community Builder!

Set default tblproperties for pipeline

AttributeError: module 'numpy' has no attribute 't...

Error occurs on create materialized view with spar...

How to create parameters that works in Power BI Re...

Data profiling monitoring with foreign catalog