- 21069 Views
- 5 replies
- 6 kudos
Forum Posts
- 9447 Views
- 8 replies
- 0 kudos
Access struct elements inside dataframe?
I have JSON data set that contains a price in a string like "USD 5.00". I'd like to convert the numeric portion to a Double to use in an MLLIB LabeledPoint, and have managed to split the price string into an array of string. The below creates a data...
- 9447 Views
- 8 replies
- 0 kudos
- 1768 Views
- 1 replies
- 0 kudos
Silent failure in DataFrameWriter when loading data to Redshift
Context:I'm using DataFrameWriter to load the dataSet into the Redshift. DataFrameWriter writes the dataSet to S3, and loads data from S3 to Redshift by issuing the Redshift copy command. Issue:In frequently we are observing, the data is present in t...
- 1768 Views
- 1 replies
- 0 kudos

- 0 kudos
Hi @Kishorekumar Somasundaram Great to meet you, and thanks for your question! Let's see if your peers in the community have an answer to your question. Thanks.
- 0 kudos
- 2663 Views
- 3 replies
- 1 kudos
Resolved! How to keep data in time-based localized clusters after joining?
I have a bunch of data frames from different data sources. They are all time series data in order of a column timestamp, which is an int32 Unix timestamp. I can join them together by this and another column join_idx which is basically an integer inde...
- 2663 Views
- 3 replies
- 1 kudos

- 1 kudos
@Erik Louie :If the data frames have different time zones, you can use Databricks' timezone conversion function to convert them to a common time zone. You can use the from_utc_timestamp or to_utc_timestampfunction to convert the timestamp column to ...
- 1 kudos
- 2037 Views
- 1 replies
- 0 kudos
Dataframes to Pandas conversion step is failing with exception ""java.lang.IndexOutOfBoundsException: index: 16384, length: 4 (expected: range(0, 16384))"
Dataframes to Pandas conversion step is failing with exception ""java.lang.IndexOutOfBoundsException: index: 16384, length: 4 (expected: range(0, 16384))", PFB screenshot for more details
- 2037 Views
- 1 replies
- 0 kudos

- 0 kudos
Hi @Vindhya D Thank you for posting your question in our community! We are happy to assist you.To help us provide you with the most accurate information, could you please take a moment to review the responses and select the one that best answers you...
- 0 kudos
- 13962 Views
- 7 replies
- 13 kudos
How to read file in pyspark with “]|[” delimiter
The data looks like this: pageId]|[page]|[Position]|[sysId]|[carId 0005]|[bmw]|[south]|[AD6]|[OP4 There are atleast 50 columns and millions of rows. I did try to use below code to read: dff = sqlContext.read.format("com.databricks.spark.csv").option...
- 13962 Views
- 7 replies
- 13 kudos
- 13 kudos
you might also try the blow option.1). Use a different file format: You can try using a different file format that supports multi-character delimiters, such as text JSON.2). Use a custom Row class: You can write a custom Row class to parse the multi-...
- 13 kudos
- 2639 Views
- 2 replies
- 0 kudos
pyspark.sql.utils.AnalysisException: Non-time-based windows are not supported on streaming DataFrames/Datasets
pyspark.sql.utils.AnalysisException: Non-time-based windows are not supported on streaming DataFrames/DatasetsGetting this error while writing can any one please tell how we can resolve it
- 2639 Views
- 2 replies
- 0 kudos
- 0 kudos
I'm trying to run query on some table and then storing that result in some table .query = stream .writeStream .format("delta") .foreachBatch(batch_function) \ .option('checkpointLocation', self.checkpoint_loc) .trigger(processingTime...
- 0 kudos
- 6622 Views
- 2 replies
- 3 kudos
Comparing 2 dataframes and create columns from values within a dataframe
Hi,I have a dataframe that has name and companyfrom pyspark.sql import SparkSessionspark = SparkSession.builder.appName('SparkByExamples.com').getOrCreate()columns = ["company","name"]data = [("company1", "Jon"), ("company2", "Steve"), ("company1", "...
- 6622 Views
- 2 replies
- 3 kudos
- 3 kudos
You need to join and pivotdf .join(df2, on=[df.company == df2.job_company])) .groupBy("company", "name") .pivot("job_company") .count()
- 3 kudos
- 1995 Views
- 3 replies
- 2 kudos
sharikrishna26.medium.com
Spark Dataframes SchemaSchema inference is not reliable.We have the following problems in schema inference:Automatic inferring of schema is often incorrectInferring schema is additional work for Spark, and it takes some extra timeSchema inference is ...
- 1995 Views
- 3 replies
- 2 kudos
- 2 kudos
one other difference between those 2 approaches is that In Schema DDL String approach we use STRING, INT etc.. But In Struct Type Object approach we can only use Spark datatypes such as StringType(), IntegerType(), etc..
- 2 kudos
- 7498 Views
- 1 replies
- 1 kudos
Python: Generate new dfs from a list of dataframes using for loop
I have a list of dataframes (for this example 2) and want to apply a for-loop to the list of frames to generate 2 new dataframes. To start, here is my starting dataframe called df_final:First, I create 2 dataframes: df2_b2c_fast, df2_b2b_fast:for x i...
- 7498 Views
- 1 replies
- 1 kudos
- 1959 Views
- 2 replies
- 0 kudos
Resolved! Save data from Spark DataFrames to TFRecords
https://docs.microsoft.com/en-us/azure/databricks/_static/notebooks/deep-learning/tfrecords-save-load.htmlI could not run the Cell # 2java.lang.ClassNotFoundException: --------------------------------------------------------------------------- Py4JJ...
- 1959 Views
- 2 replies
- 0 kudos
- 0 kudos
Hi @THIAM HUAT TAN,Which DBR version are you using? are you using the ML runtime?
- 0 kudos
- 1626 Views
- 1 replies
- 1 kudos
- 1626 Views
- 1 replies
- 1 kudos
- 1 kudos
Hi,Could you share more details on what you have tried? please provide more details.
- 1 kudos
- 13798 Views
- 6 replies
- 8 kudos
Resolved! how to flatten non standard Json files in a dataframe
hello, I have a non standard Json file with a nested file structure that I have issues with. Here is an example of the json file. jsonfile= """[ { "success":true, "numRows":2, "data":{ "58251":{ "invoiceno":"58...
- 13798 Views
- 6 replies
- 8 kudos
- 8 kudos
@stale stokkereit You can use the below function to flatten the struct fieldimport pyspark.sql.functions as F def flatten_df(nested_df): flat_cols = [c[0] for c in nested_df.dtypes if c[1][:6] != 'struct'] nested_cols = [c[0] for c in nest...
- 8 kudos
- 7177 Views
- 3 replies
- 3 kudos
Is there a way to CONCAT two dataframes on either of the axis (row/column) and transpose the dataframe in PySpark?
I'm reshaping my dataframe as per requirement and I came across this situation where I'm concatenating 2 dataframes and then transposing them. I've done this previously using pandas and the syntax for pandas goes as below:import pandas as pd df1 = ...
- 7177 Views
- 3 replies
- 3 kudos
- 3 kudos
Hi @Kaniz Fatma ,I no longer see the answer you've posted, but I see you were suggesting to use `union`. As per my understanding, union are used to stack the dfs one upon another with similar schema / column names.In my situation, I have 2 different...
- 3 kudos
- 4907 Views
- 1 replies
- 1 kudos
Append an empty dataframe to a list of dataframes using for loop in python
I have the following 3 dataframes:I want to append df_forecast to each of df2_CA and df2_USA using a for-loop. However when I run my code, df_forecast is not appending: df2_CA and df2_USA appear exactly as shown above.Here’s the code:df_list=[df2_CA,...
- 4907 Views
- 1 replies
- 1 kudos
- 1 kudos
@Jack Homareau Can you try union functionality with dataframes?https://sparkbyexamples.com/pyspark/pyspark-union-and-unionall/and then try to fill NaNs with the desired values?
- 1 kudos