Data Engineering

Forum Posts

Sorted by:

Start a conversation

by Databricks_POC • New Contributor II

12-20-2021 1:14:14 AM

20307 Views
5 replies
6 kudos

Resolved! I want to compare two data frames. In output I wish to see unmatched Rows and the columns identified leading to the differences.

Data Engineering

20307 Views
5 replies
6 kudos

12-20-2021 1:14:14 AM

View Replies

Latest Reply

bhargavi1
New Contributor II

04-28-2022 1:53:19 AM

6 kudos

@vinita shinde are you Cracked this Code?

6 kudos

04-28-2022 1:53:19 AM

4 More Replies

by schnee1 • New Contributor III

10-23-2015 6:07:48 AM

9196 Views
8 replies
0 kudos

Access struct elements inside dataframe?

I have JSON data set that contains a price in a string like "USD 5.00". I'd like to convert the numeric portion to a Double to use in an MLLIB LabeledPoint, and have managed to split the price string into an array of string. The below creates a data...

Data Engineering

9196 Views
8 replies
0 kudos

10-23-2015 6:07:48 AM

View Replies

Latest Reply

goldentriangle
New Contributor II

08-10-2023 8:26:34 PM

0 kudos

Thanks, Golden Triangle Tour

0 kudos

08-10-2023 8:26:34 PM

7 More Replies

by kishorekumar • New Contributor

06-20-2023 5:46:52 AM

1696 Views
1 replies
0 kudos

Silent failure in DataFrameWriter when loading data to Redshift

Context:I'm using DataFrameWriter to load the dataSet into the Redshift. DataFrameWriter writes the dataSet to S3, and loads data from S3 to Redshift by issuing the Redshift copy command. Issue:In frequently we are observing, the data is present in t...

Data Engineering

1696 Views
1 replies
0 kudos

06-20-2023 5:46:52 AM

View Replies

Latest Reply

Anonymous
Not applicable

06-20-2023 8:23:29 PM

0 kudos

Hi @Kishorekumar Somasundaram Great to meet you, and thanks for your question! Let's see if your peers in the community have an answer to your question. Thanks.

0 kudos

06-20-2023 8:23:29 PM

by Erik_L • Contributor II

04-20-2023 4:22:59 PM

2558 Views
3 replies
1 kudos

Resolved! How to keep data in time-based localized clusters after joining?

I have a bunch of data frames from different data sources. They are all time series data in order of a column timestamp, which is an int32 Unix timestamp. I can join them together by this and another column join_idx which is basically an integer inde...

Data Engineering

2558 Views
3 replies
1 kudos

04-20-2023 4:22:59 PM

View Replies

Latest Reply

Anonymous
Not applicable

04-20-2023 7:16:25 PM

1 kudos

@Erik Louie :If the data frames have different time zones, you can use Databricks' timezone conversion function to convert them to a common time zone. You can use the from_utc_timestamp or to_utc_timestampfunction to convert the timestamp column to ...

1 kudos

04-20-2023 7:16:25 PM

2 More Replies

by Vindhya • New Contributor II

04-18-2023 3:41:51 PM

1975 Views
1 replies
0 kudos

Dataframes to Pandas conversion step is failing with exception ""java.lang.IndexOutOfBoundsException: index: 16384, length: 4 (expected: range(0, 16384))"

Dataframes to Pandas conversion step is failing with exception ""java.lang.IndexOutOfBoundsException: index: 16384, length: 4 (expected: range(0, 16384))", PFB screenshot for more details

Data Engineering

1975 Views
1 replies
0 kudos

04-18-2023 3:41:51 PM

View Replies

Latest Reply

Anonymous
Not applicable

04-23-2023 9:14:00 PM

0 kudos

Hi @Vindhya D Thank you for posting your question in our community! We are happy to assist you.To help us provide you with the most accurate information, could you please take a moment to review the responses and select the one that best answers you...

0 kudos

04-23-2023 9:14:00 PM

by lambarc • New Contributor II

01-18-2017 1:14:50 PM

13543 Views
7 replies
13 kudos

How to read file in pyspark with “]|[” delimiter

Data Engineering

13543 Views
7 replies
13 kudos

01-18-2017 1:14:50 PM

View Replies

Latest Reply

rohit199912
New Contributor II

01-31-2023 10:59:58 PM

13 kudos

you might also try the blow option.1). Use a different file format: You can try using a different file format that supports multi-character delimiters, such as text JSON.2). Use a custom Row class: You can write a custom Row class to parse the multi-...

13 kudos

01-31-2023 10:59:58 PM

6 More Replies

by Gaurav_784295 • New Contributor III

01-20-2023 1:57:02 AM

2526 Views
2 replies
0 kudos

pyspark.sql.utils.AnalysisException: Non-time-based windows are not supported on streaming DataFrames/Datasets

pyspark.sql.utils.AnalysisException: Non-time-based windows are not supported on streaming DataFrames/DatasetsGetting this error while writing can any one please tell how we can resolve it

Data Engineering

2526 Views
2 replies
0 kudos

01-20-2023 1:57:02 AM

View Replies

Latest Reply

Gaurav_784295
New Contributor III

01-21-2023 8:57:37 AM

0 kudos

I'm trying to run query on some table and then storing that result in some table .query = stream .writeStream .format("delta") .foreachBatch(batch_function) \ .option('checkpointLocation', self.checkpoint_loc) .trigger(processingTime...

0 kudos

01-21-2023 8:57:37 AM

1 More Replies

by lmcglone • New Contributor II

01-11-2023 8:08:37 AM

6351 Views
2 replies
3 kudos

Comparing 2 dataframes and create columns from values within a dataframe

Hi,I have a dataframe that has name and companyfrom pyspark.sql import SparkSessionspark = SparkSession.builder.appName('SparkByExamples.com').getOrCreate()columns = ["company","name"]data = [("company1", "Jon"), ("company2", "Steve"), ("company1", "...

Data Engineering

6351 Views
2 replies
3 kudos

01-11-2023 8:08:37 AM

View Replies

Latest Reply

Hubert-Dudek
Esteemed Contributor III

01-11-2023 8:59:13 AM

3 kudos

You need to join and pivotdf .join(df2, on=[df.company == df2.job_company])) .groupBy("company", "name") .pivot("job_company") .count()

3 kudos

01-11-2023 8:59:13 AM

1 More Replies

by SIRIGIRI • Contributor

12-31-2022 5:38:45 AM

1925 Views
3 replies
2 kudos

sharikrishna26.medium.com

Spark Dataframes SchemaSchema inference is not reliable.We have the following problems in schema inference:Automatic inferring of schema is often incorrectInferring schema is additional work for Spark, and it takes some extra timeSchema inference is ...

Data Engineering

1925 Views
3 replies
2 kudos

12-31-2022 5:38:45 AM

View Replies

Latest Reply

Varshith
New Contributor III

01-01-2023 7:05:25 PM

2 kudos

one other difference between those 2 approaches is that In Schema DDL String approach we use STRING, INT etc.. But In Struct Type Object approach we can only use Spark datatypes such as StringType(), IntegerType(), etc..

2 kudos

01-01-2023 7:05:25 PM

2 More Replies

by Jack • New Contributor II

05-02-2022 6:43:59 AM

7289 Views
1 replies
1 kudos

Python: Generate new dfs from a list of dataframes using for loop

I have a list of dataframes (for this example 2) and want to apply a for-loop to the list of frames to generate 2 new dataframes. To start, here is my starting dataframe called df_final:First, I create 2 dataframes: df2_b2c_fast, df2_b2b_fast:for x i...

Data Engineering

7289 Views
1 replies
1 kudos

05-02-2022 6:43:59 AM

View Replies

Latest Reply

Aviral-Bhardwaj
Esteemed Contributor III

12-02-2022 1:44:45 AM

1 kudos

thanks

1 kudos

12-02-2022 1:44:45 AM

by THIAM_HUATTAN • Valued Contributor

06-29-2022 5:42:51 AM

1828 Views
2 replies
0 kudos

Resolved! Save data from Spark DataFrames to TFRecords

https://docs.microsoft.com/en-us/azure/databricks/_static/notebooks/deep-learning/tfrecords-save-load.htmlI could not run the Cell # 2java.lang.ClassNotFoundException: --------------------------------------------------------------------------- Py4JJ...

Data Engineering

1828 Views
2 replies
0 kudos

06-29-2022 5:42:51 AM

View Replies

Latest Reply

jose_gonzalez
Databricks Employee

07-05-2022 10:47:39 AM

0 kudos

Hi @THIAM HUAT TAN,Which DBR version are you using? are you using the ML runtime?

0 kudos

07-05-2022 10:47:39 AM

1 More Replies

by bhargavi1 • New Contributor II

04-26-2022 3:45:30 AM

1551 Views
1 replies
1 kudos

I want to compare two data frames. In output I wish to see unmatched Rows and the columns dataframe identified leading to the differences. and how to generate Matching Percentage of data frames. if anyone know this help to crack this code. Thanks

Data Engineering

1551 Views
1 replies
1 kudos

04-26-2022 3:45:30 AM

View Replies

Latest Reply

jose_gonzalez
Databricks Employee

06-14-2022 5:16:43 PM

1 kudos

Hi,Could you share more details on what you have tried? please provide more details.

1 kudos

06-14-2022 5:16:43 PM

by steelman • New Contributor III

04-07-2022 1:11:36 AM

12913 Views
6 replies
8 kudos

Resolved! how to flatten non standard Json files in a dataframe

hello, I have a non standard Json file with a nested file structure that I have issues with. Here is an example of the json file. jsonfile= """[ { "success":true, "numRows":2, "data":{ "58251":{ "invoiceno":"58...

desired format in the dataframe after processing the json file

Data Engineering

12913 Views
6 replies
8 kudos

04-07-2022 1:11:36 AM

View Replies

Latest Reply

Deepak_Bhutada
Contributor III

05-13-2022 9:37:50 AM

8 kudos

@stale stokkereit You can use the below function to flatten the struct fieldimport pyspark.sql.functions as F def flatten_df(nested_df): flat_cols = [c[0] for c in nested_df.dtypes if c[1][:6] != 'struct'] nested_cols = [c[0] for c in nest...

8 kudos

05-13-2022 9:37:50 AM

5 More Replies

by RiyazAli • Valued Contributor II

06-06-2022 7:21:48 PM

7000 Views
3 replies
3 kudos

Is there a way to CONCAT two dataframes on either of the axis (row/column) and transpose the dataframe in PySpark?

I'm reshaping my dataframe as per requirement and I came across this situation where I'm concatenating 2 dataframes and then transposing them. I've done this previously using pandas and the syntax for pandas goes as below:import pandas as pd df1 = ...

Data Engineering

7000 Views
3 replies
3 kudos

06-06-2022 7:21:48 PM

View Replies

Latest Reply

RiyazAli
Valued Contributor II

06-06-2022 11:45:41 PM

3 kudos

Hi @Kaniz Fatma ,I no longer see the answer you've posted, but I see you were suggesting to use `union`. As per my understanding, union are used to stack the dfs one upon another with similar schema / column names.In my situation, I have 2 different...

3 kudos

06-06-2022 11:45:41 PM

2 More Replies

by Jack • New Contributor II

06-02-2022 7:44:33 AM

4731 Views
1 replies
1 kudos

Append an empty dataframe to a list of dataframes using for loop in python

I have the following 3 dataframes:I want to append df_forecast to each of df2_CA and df2_USA using a for-loop. However when I run my code, df_forecast is not appending: df2_CA and df2_USA appear exactly as shown above.Here’s the code:df_list=[df2_CA,...

Data Engineering

4731 Views
1 replies
1 kudos

06-02-2022 7:44:33 AM

View Replies

Latest Reply

User16764241763
Honored Contributor

06-05-2022 9:36:22 PM

1 kudos

@Jack Homareau Can you try union functionality with dataframes?https://sparkbyexamples.com/pyspark/pyspark-union-and-unionall/and then try to fill NaNs with the desired values?

1 kudos

06-05-2022 9:36:22 PM