cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 
Data + AI Summit 2024 - Data Engineering & Streaming

Forum Posts

shampa
by New Contributor
  • 4406 Views
  • 1 replies
  • 0 kudos

How can we compare two dataframes in spark scala to find difference between these 2 files, which column ?? and value ??.

I have two files and I created two dataframes prod1 and prod2 out of it.I need to find the records with column names and values that are not matching in both the dfs. id_sk is the primary key .all the cols are string datatype dataframe 1 (prod1) id_...

  • 4406 Views
  • 1 replies
  • 0 kudos
Latest Reply
manojlukhi
New Contributor II
  • 0 kudos

use full Outer Join in spark SQL

  • 0 kudos
ArielHerrera
by New Contributor II
  • 14318 Views
  • 2 replies
  • 0 kudos

Resolved! How to create blank target links in markdown to open url link in new tabs?

I am using markdown to include links urls. I am using the below markdown syntax: [link text](http://example.com) The issue is each time I click the linked text it opens the url in the same tab as the notebook. I want the url to open it in a new ta...

  • 14318 Views
  • 2 replies
  • 0 kudos
Latest Reply
Anonymous
Not applicable
  • 0 kudos

Hi @Ariel Herrera, You can just put html anchor tag in databricks notebook cell. It will open a new tab when you click it. Please try the example below. It works for me in databricks notebook. %md <a href="https://google.com" target="_blank">google ...

  • 0 kudos
1 More Replies
cfregly
by Contributor
  • 6535 Views
  • 5 replies
  • 0 kudos
  • 6535 Views
  • 5 replies
  • 0 kudos
Latest Reply
MatthewValenti
New Contributor II
  • 0 kudos

This is an old post, however, is this still accurate for the latest version of Databricks in 2019? If so, how to approach the following?1. Connect to many MongoDBs.2. Connect to MongoDB when connection string information is dynamic (i.e. stored in s...

  • 0 kudos
4 More Replies
senthilkumar
by New Contributor
  • 15668 Views
  • 1 replies
  • 0 kudos

How filter condition working in spark dataframe?

I have a table in hbase with 1 billions records.I want to filter the records based on certain condition (by date). For example: Dataframe.filter(col(date) === todayDate) Filter will be applied after all records from the table will be loaded into me...

  • 15668 Views
  • 1 replies
  • 0 kudos
Latest Reply
muk1
New Contributor II
  • 0 kudos

Hello @senthil kumar​ To pass external values to the filter (or where) transformations you can use the "lit" function in the following way:Dataframe.filter(col(date) == lit(todayDate))don´t know if that helps. Be careful with the schema infered by th...

  • 0 kudos
DominicRobinson
by New Contributor II
  • 10095 Views
  • 4 replies
  • 0 kudos

Issues with UTF-16 files and unicode characters

Can someone please offer some insight - I've spent days trying to solve this issue We have the task of loading in hundreds of tab seperated text files encoded in UTF-16 little endian with a tab delimiter. Our organisation is an international one and...

  • 10095 Views
  • 4 replies
  • 0 kudos
Latest Reply
User16817872376
New Contributor III
  • 0 kudos

You can also always read in the file as a textFile, and then run a UTF-16 decoder/encoder library as a UDF on the text.

  • 0 kudos
3 More Replies
Tamara
by New Contributor III
  • 10208 Views
  • 8 replies
  • 2 kudos

Resolved! Can I connect to a MS SQL server table in Databricks account?

I'd like to access a table on a MS SQL Server (Microsoft). Is it possible from Databricks? To my understanding, the syntax is something like this (in a SQL Notebook): CREATE TEMPORARY TABLE jdbcTable USING org.apache.spark.sql.jdbc OPTIONS ( url...

  • 10208 Views
  • 8 replies
  • 2 kudos
Latest Reply
JohnSmith091
New Contributor II
  • 2 kudos

Thanks for the trick that you have shared with us. I am really amazed to use this informational post. If you are facing MacBook error like MacBook Pro won't turn on black screen then click the link.

  • 2 kudos
7 More Replies
juan_perez
by New Contributor
  • 11776 Views
  • 2 replies
  • 0 kudos

Write data Frame into Azure Data Lake Storage

It happens that I am manipulating some data using Azure Databricks. Such data is in an Azure Data Lake Storage Gen1. I mounted the data into DBFS, but now, after transforming the data I would like to write it back into my data lake. To mount the dat...

  • 11776 Views
  • 2 replies
  • 0 kudos
Latest Reply
PawanShukla
New Contributor III
  • 0 kudos

I am new in Azure Data Bricks..and I am trying to write the Data frame in mounted ADLS file. But in below command dfGPS.write.mode("overwrite").format("com.databricks.spark.csv").option("header","true").csv("/mnt/<mount-name>")

  • 0 kudos
1 More Replies
SatheesshChinnu
by New Contributor III
  • 9182 Views
  • 4 replies
  • 0 kudos

Resolved! Error: TransportResponseHandler: Still have 1 requests outstanding when connection, occurring only on large dataset.

I am getting below error only during large dataset(i.e 15 TB compressed) . if my dataset is small( 1TB) i am not getting this error. Look like it fails on shuffle stage. Approx number of mappers is 150,000 Spark config:spark.sql.warehouse.dir hdfs:...

  • 9182 Views
  • 4 replies
  • 0 kudos
Latest Reply
parikshitbhoyar
New Contributor II
  • 0 kudos

@Satheessh Chinnusamy how did you solve the above issue

  • 0 kudos
3 More Replies
WajdiFATHALLAH
by New Contributor
  • 11469 Views
  • 4 replies
  • 0 kudos

Writing large parquet file (500 millions row / 1000 columns) to S3 takes too much time

Hello community,First let me introduce my use case, i daily receive a 500 million rows like so :ID | Categories 1 | cat1, cat2, cat3, ..., catn 2 | cat1, catx, caty, ..., anothercategory Input data: 50 compressed csv files each file is 250 MB ...

  • 11469 Views
  • 4 replies
  • 0 kudos
Latest Reply
EliasHaydar
New Contributor II
  • 0 kudos

So you are basically creating an inverted index ?

  • 0 kudos
3 More Replies
z160896
by New Contributor II
  • 7029 Views
  • 2 replies
  • 0 kudos

why spark very slow with large number of dataframe columns

scala Spark App: I have a dataset of 130x14000. I read from a parquet file with SparkSession. Then used for Spark ML Random Forest model (using pipeline). It takes 7 hours to complete! for reading the parquet file takes about 1 minute. If I implemen...

  • 7029 Views
  • 2 replies
  • 0 kudos
Latest Reply
EliasHaydar
New Contributor II
  • 0 kudos

I've already answered a similar question on StackOverflow so I'll repeat what a I said there. The following may not solve your problem completely but it should give you some pointer to start. The first problem that you are facing is the disproportio...

  • 0 kudos
1 More Replies
vin007
by New Contributor
  • 5614 Views
  • 1 replies
  • 0 kudos

How to store a pyspark dataframe in S3 bucket.

I have a pyspark dataframe df containing 4 columns. How can I write this dataframe to s3 bucket? I'm using pycharm to execute the code. and what are the packages required to be installed?

  • 5614 Views
  • 1 replies
  • 0 kudos
Latest Reply
AndrewSears
New Contributor III
  • 0 kudos

You shouldn't need any packages. You can mount S3 bucket to Databricks cluster. https://docs.databricks.com/spark/latest/data-sources/aws/amazon-s3.html#mount-aws-s3 or this http://www.sparktutorials.net/Reading+and+Writing+S3+Data+with+Apache+Spark...

  • 0 kudos
SiddarthaPaturu
by New Contributor II
  • 25250 Views
  • 8 replies
  • 1 kudos

Resolved! Comparing two dataframes

How can we compare two data frames using pyspark I need to validate my output with another dataset

  • 25250 Views
  • 8 replies
  • 1 kudos
Latest Reply
sbharti
New Contributor II
  • 1 kudos

I think the best bet in such a case is to take inner join (equivalent to intersection) by putting a condition on those columns which necessarily need to have same value in both dataframes. For example, let df1 and df2 are two dataframes. df1 has co...

  • 1 kudos
7 More Replies
mlm
by New Contributor
  • 9326 Views
  • 5 replies
  • 0 kudos

How to prevent spark-csv from adding quotes to JSON string in dataframe

I have a sql dataframe with a column that has a json string in it (e.g. {"key":"value"}). When I use spark-csv to save the dataframe it changes the field values to be "{""key"":""valule""}". Is there a way to turn that off?

  • 9326 Views
  • 5 replies
  • 0 kudos
Latest Reply
AshleyPan
New Contributor II
  • 0 kudos

Do quote or escape options only work with "Write" instead of "read"? Our source files contain doube quotes. We'd like to add backsplash (escape) in front each double quote before converting the values from out dataframes to json outputs.

  • 0 kudos
4 More Replies
bkr
by New Contributor
  • 5883 Views
  • 1 replies
  • 0 kudos

How to move files of same extension in databricks files system?

I am facing file not found exception when i am trying to move the file with * in DBFS. Here both source and destination directories are in DBFS. I have the source file named "test_sample.csv" available in dbfs directory and i am using the command li...

  • 5883 Views
  • 1 replies
  • 0 kudos
Latest Reply
ricardo_portill
New Contributor III
  • 0 kudos

@bkr, you can reference the file name using dbutils and then pass this to the move command. Here's an example for this in Scala: val fileNm = dbutils.fs.ls("/usr/krishna/sample").map(_.name).filter(r => r.startsWith("test"))(0) val fileLoc = "dbfs:/...

  • 0 kudos
rlgarris
by New Contributor III
  • 6205 Views
  • 5 replies
  • 0 kudos

Resolved! How do I get a cartesian product of a huge dataset?

A cartesian product is a common operation to get the cross product of two tables. For example, say you have a list of customers and a list of your product catalog and want to get the cross product of all customer - product combinations. Cartesian pr...

  • 6205 Views
  • 5 replies
  • 0 kudos
Latest Reply
Forum_Admin
Contributor
  • 0 kudos

Hi buddies, it is great written piece entirely defined, continue the good work constantly.

  • 0 kudos
4 More Replies
Join 100K+ Data Experts: Register Now & Grow with Us!

Excited to expand your horizons with us? Click here to Register and begin your journey to success!

Already a member? Login and join your local regional user group! If there isn’t one near you, fill out this form and we’ll create one for you to join!

Labels