cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Forum Posts

bkr
by New Contributor
  • 6850 Views
  • 1 replies
  • 0 kudos

How to move files of same extension in databricks files system?

I am facing file not found exception when i am trying to move the file with * in DBFS. Here both source and destination directories are in DBFS. I have the source file named "test_sample.csv" available in dbfs directory and i am using the command li...

  • 6850 Views
  • 1 replies
  • 0 kudos
Latest Reply
ricardo_portill
New Contributor III
  • 0 kudos

@bkr, you can reference the file name using dbutils and then pass this to the move command. Here's an example for this in Scala: val fileNm = dbutils.fs.ls("/usr/krishna/sample").map(_.name).filter(r => r.startsWith("test"))(0) val fileLoc = "dbfs:/...

  • 0 kudos
rlgarris
by Databricks Employee
  • 9777 Views
  • 5 replies
  • 0 kudos

Resolved! How do I get a cartesian product of a huge dataset?

A cartesian product is a common operation to get the cross product of two tables. For example, say you have a list of customers and a list of your product catalog and want to get the cross product of all customer - product combinations. Cartesian pr...

  • 9777 Views
  • 5 replies
  • 0 kudos
Latest Reply
Forum_Admin
Contributor
  • 0 kudos

Hi buddies, it is great written piece entirely defined, continue the good work constantly.

  • 0 kudos
4 More Replies
Mahesha999
by New Contributor II
  • 6153 Views
  • 3 replies
  • 0 kudos

Resolving NoClassDefFoundError: org/apache/spark/Logging exception

I was trying out hbase-spark connector. To start with, I am trying out this code. My pom dependencies are: <dependencies> <dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-core_2.11</artifactId> <version...

  • 6153 Views
  • 3 replies
  • 0 kudos
Latest Reply
User16301467518
New Contributor II
  • 0 kudos

The alpha of hbase-spark you're using depends on Spark 1.6 -- see hbase-spark/pom.xml:L33 -- so you'll probably have to stick with 1.6 if you want to use that published jar. For reasons I don't understand hbase-spark was removed in the last couple o...

  • 0 kudos
2 More Replies
semihcandoken
by New Contributor
  • 17681 Views
  • 4 replies
  • 0 kudos

How to convert column type from str to date in sparksql when the format is not yyyy-mm-dd?

I imported a large csv file into databricks as a table. I am able to run sql queries on it in a databricks notebook. In my table, I have a column that contains date information in the mm/dd/yyyy format : 12/29/2015 12/30/2015 etc... Databricks impo...

  • 17681 Views
  • 4 replies
  • 0 kudos
Latest Reply
ShubhamGupta187
New Contributor II
  • 0 kudos

@josephpconley would it be safe to cast a column that contains null values?

  • 0 kudos
3 More Replies
Young_TackPark
by New Contributor
  • 19959 Views
  • 2 replies
  • 0 kudos

upload local files into DBFS

I am using Databricks Notebook Community Edition (2.36) and want to upload a local file into DBFS. Is there any simple Hadoop commands like "hadoop fs -put ..."? Any help would be appreciated.

  • 19959 Views
  • 2 replies
  • 0 kudos
Latest Reply
sushrutt_12
New Contributor II
  • 0 kudos

Python 2.7:import urllib.request urllib.urlretrieve("https://github.com/sushrutt12/DataSets/blob/master/final_chris.zip","/tmp/chris_data.zip") dbutils.fs.mv("file:/tmp/chris_data.zip", "dbfs:/data/chris_data.zip")Python 3.x: import urllib.requesturl...

  • 0 kudos
1 More Replies
ArvindShyamsund
by New Contributor II
  • 12456 Views
  • 12 replies
  • 0 kudos

Resolved! Custom line separator

I see that https://github.com/apache/spark/pull/18581 will enable defining custom Line Separators for many sources, including CSV. Apart from waiting on this PR to make it into the main Databricks runtime, is there any other alternative to support d...

  • 12456 Views
  • 12 replies
  • 0 kudos
Latest Reply
DanielTomes
New Contributor II
  • 0 kudos

You can use newAPIHadoopFile SCALA import org.apache.hadoop.io.LongWritable import org.apache.hadoop.io.Text import org.apache.hadoop.conf.Configuration import org.apache.hadoop.mapreduce.lib.input.TextInputFormat val conf = new Configuration conf.s...

  • 0 kudos
11 More Replies
max522over
by New Contributor II
  • 18393 Views
  • 3 replies
  • 0 kudos

Resolved! I've set the partition mode to nonstrict in hive but spark is not seeing it

I've got a table I want to add some data to and it's partitoned. I want to use dynamic partitioning but I get this error org.apache.spark.SparkException: Dynamic partition strict mode requires at least one static partition column. To turn this off ...

  • 18393 Views
  • 3 replies
  • 0 kudos
Latest Reply
max522over
New Contributor II
  • 0 kudos

I got it working. This was exactly what I needed. Thank you @Peyman Mohajerian​ 

  • 0 kudos
2 More Replies
PrasadGaikwad
by New Contributor
  • 11284 Views
  • 0 replies
  • 0 kudos

TypeError: Column is not iterable when using more than one columns in withColumn()

I am trying to find quarter start date from a date column. I get the expected result when i write it using selectExpr() but when i add the same logic in .withColumn() i get TypeError: Column is not iterableI am using a workaround as follows workarou...

  • 11284 Views
  • 0 replies
  • 0 kudos
kkarthik
by New Contributor
  • 6438 Views
  • 1 replies
  • 0 kudos

I want to split a dataframe with date range 1 week, with each week data in different column.

DF Q Date(yyyy-mm-dd) q1 2017-10-01 q2 2017-10-03 q1 2017-10-09 q3 2017-10-06 q2 2017-10-01 q1 2017-10-13 Q1 2017-10-02 Q3 2017-10-21 Q4 2017-10-17 Q5 2017-10-20 Q4 2017-10-31 Q2 2017-10-27 Q5 2017-10-01 Dataframe: ...

  • 6438 Views
  • 1 replies
  • 0 kudos
Latest Reply
User16857281974
Contributor
  • 0 kudos

It should just be a matter of applying the correct set of transformations:You can start by adding the week-of-year to each record with the command pyspark.sql.functions.weekofyear(..) and name it something like weekOfYear. See https://spark.apache.or...

  • 0 kudos
SethuSrinivasan
by New Contributor II
  • 36864 Views
  • 0 replies
  • 2 kudos

Requesting support for "SELECT TOP n from Table"

In notebook, It looks like if I need to select top N rows, I can rely on "LIMIT" keyword. It would be nice if you can support "TOP" as well The current approach to select 10 rows: select * from table1 LIMIT 10 Requesting TOP support: SELECT TOP 10 *...

  • 36864 Views
  • 0 replies
  • 2 kudos
XinZodl
by New Contributor III
  • 19125 Views
  • 3 replies
  • 1 kudos

Resolved! How to parse a file with newline character, escaped with \ and not quoted

Hi! I am facing an issue when reading and parsing a CSV file. Some records have a newline symbol, "escaped" by a \, and that record not being quoted. The file might look like this: Line1field1;Line1field2.1 \ Line1field2.2;Line1field3; Line2FIeld1;...

  • 19125 Views
  • 3 replies
  • 1 kudos
Latest Reply
XinZodl
New Contributor III
  • 1 kudos

Solution is "sparkContext.wholeTextFiles"

  • 1 kudos
2 More Replies
kelleyrw
by New Contributor II
  • 13920 Views
  • 7 replies
  • 0 kudos

Resolved! How do I register a UDF that returns an array of tuples in scala/spark?

I'm relatively new to Scala. In the past, I was able to do the following python: def foo(p1, p2): import datetime as dt dt.datetime(2014, 4, 17, 12, 34) result = [ (1, "1", 1.1, dt.datetime(2014, 4, 17, 1, 0)), (2, "2", 2...

0693f000007OoHdAAK
  • 13920 Views
  • 7 replies
  • 0 kudos
Latest Reply
__max
New Contributor III
  • 0 kudos

Hello, Just in case, here is an example for proposed solution above: import org.apache.spark.sql.functions._ import org.apache.spark.sql.expressions._ import org.apache.spark.sql.types._ val data = Seq(("A", Seq((3,4),(5,6),(7,10))), ("B", Seq((-1,...

  • 0 kudos
6 More Replies
samalexg
by New Contributor III
  • 21953 Views
  • 13 replies
  • 1 kudos

How to add environment variable

Instead of setting the AWS accessKey and secret Key in hadoopConfiguration, I would like to add those in environment variables AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY. How can I do that in databricks?

  • 21953 Views
  • 13 replies
  • 1 kudos
Latest Reply
jric
New Contributor II
  • 1 kudos

It is possible! I was able to confirm that the following post's "Best" answer works: https://forums.databricks.com/questions/11116/how-to-set-an-environment-variable.htmlFYI for @Miklos Christine​  and @Mike Trewartha​ 

  • 1 kudos
12 More Replies
KiranRastogi
by New Contributor
  • 43839 Views
  • 2 replies
  • 2 kudos

Pandas dataframe to a table

I want to write a pandas dataframe to a table, how can I do this ? Write command is not working, please help.

  • 43839 Views
  • 2 replies
  • 2 kudos
Latest Reply
amy_wang
New Contributor II
  • 2 kudos

Hey Kiran, Just taking a stab in the dark but do you want to convert the Pandas DataFrame to a Spark DataFrame and then write out the Spark DataFrame as a non-temporary SQL table? import pandas as pd ## Create Pandas Frame pd_df = pd.DataFrame({u'20...

  • 2 kudos
1 More Replies

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!

Sign Up Now
Labels