cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Forum Posts

cmotla
by New Contributor III
  • 1630 Views
  • 3 replies
  • 8 kudos

Issue with complex json based data frame select

We are getting the below error when trying to select the nested columns (string type in a struct) even though we don't have more than a 1000 records in the data frame. The schema is very complex and has few columns as struct type and few as array typ...

  • 1630 Views
  • 3 replies
  • 8 kudos
Latest Reply
Kaniz_Fatma
Community Manager
  • 8 kudos

Hi @Chaitanya Motla​ , Just a friendly follow-up. Do you still need help, or did you find the solution? Please let us know.

  • 8 kudos
2 More Replies
LanceYoung
by New Contributor III
  • 7401 Views
  • 7 replies
  • 6 kudos

Resolved! Unable to make Databricks API calls from an HTML iframe rendered by a notebook's `displayHTML()` call, due to the browser enforcing CORS policy.

My GoalI want to make my Databricks Notebooks more interactive and have custom HTML/JS UI widgets that guide non-technical people through a business/data process. I want the HTML/JS widget to be able to execute a DB job, or execute some python code t...

  • 7401 Views
  • 7 replies
  • 6 kudos
Latest Reply
Kaniz_Fatma
Community Manager
  • 6 kudos

Hi @Lance Young​ , Just a friendly follow-up. Do you still need help, or have you resolved your problem using the above solutions? Please let us know.

  • 6 kudos
6 More Replies
MartinB
by Contributor III
  • 15191 Views
  • 26 replies
  • 6 kudos

Resolved! Does partition pruning / partition elimination not work for folder partitioned JSON files? (Spark 3.1.2)

Imagine the following setup:I have log files stored as JSON files partitioned by year, month, day and hour in physical folders:""" /logs |-- year=2020 |-- year=2021 `-- year=2022 |-- month=01 `-- month=02 |-- day=01 |-- day=.....

  • 15191 Views
  • 26 replies
  • 6 kudos
Latest Reply
MartinB
Contributor III
  • 6 kudos

@Kaniz Fatma​  could you maybe involve a Databricks expert?

  • 6 kudos
25 More Replies
Jana
by New Contributor III
  • 5614 Views
  • 8 replies
  • 2 kudos

Resolved! Parsing 5 GB json file is running long on cluster

I was creating delta table from ADLS json input file. but the job was running long while creating delta table from json. Below is my cluster configuration. Is the issue related to cluster config ? Do I need to upgrade the cluster config ?The cluster ...

  • 5614 Views
  • 8 replies
  • 2 kudos
Latest Reply
-werners-
Esteemed Contributor III
  • 2 kudos

with multiline = true, the json is read as a whole and processed as such.I'd try with a beefier cluster.

  • 2 kudos
7 More Replies
SailajaB
by Valued Contributor III
  • 5013 Views
  • 12 replies
  • 4 kudos

Resolved! JSON validation is getting failed after writing Pyspark dataframe to json format

Hi We have to convert transformed dataframe to json format. So we used write and json format on top of final dataframe to convert it to json. But when we validating the output json its not in proper json format.Could you please provide your suggestio...

  • 5013 Views
  • 12 replies
  • 4 kudos
Latest Reply
Anonymous
Not applicable
  • 4 kudos

@Sailaja B​ - Does @Aman Sehgal​'s most recent answer help solve the problem? If it does, would you be happy to mark their answer as best?

  • 4 kudos
11 More Replies
SailajaB
by Valued Contributor III
  • 2736 Views
  • 4 replies
  • 6 kudos

Resolved! how to create a nested(unflatten) json from flatten json

Hi ,Is there any function in pyspark which can convert flatten json to nested json.Ex : if we have attribute in flatten is like a_b_c : 23then in unflatten it should be{"a":{"b":{"c":23}}}Thank you

  • 2736 Views
  • 4 replies
  • 6 kudos
Latest Reply
Hubert-Dudek
Esteemed Contributor III
  • 6 kudos

As @Chuck Connell​ said can you share more of your source json as that example is not json. Additionally flatten is usually to change something like {"status": {"A": 1,"B": 2}} to {"status.A": 1, "status.B": 2} which can be done easily with spark da...

  • 6 kudos
3 More Replies
cconnell
by Contributor II
  • 613 Views
  • 2 replies
  • 1 kudos

www.linkedin.com

Importing JSON to Databricks (PySpark) is simple in the simple case. But of course there are wrinkles for real-world data. Here are some tips/tricks to help...https://www.linkedin.com/pulse/json-databricks-pyspark-chuck-connell/

  • 613 Views
  • 2 replies
  • 1 kudos
Latest Reply
Kaniz_Fatma
Community Manager
  • 1 kudos

Hi @Chuck Connell​ , Thank you for sharing such an amazing article!

  • 1 kudos
1 More Replies
SailajaB
by Valued Contributor III
  • 1330 Views
  • 4 replies
  • 4 kudos

facing format issue while converting one type nested json to other brand new json schema

Hi,We are writing our flatten json dataframe to user defined nested schema json using pysprk in Databricks.But we are not getting the expected formatExpecting : {"ID":"aaa",c_id":[{"con":null,"createdate":"2015-10-09T00:00:00Z","data":null,"id":"1"},...

  • 1330 Views
  • 4 replies
  • 4 kudos
Latest Reply
Hubert-Dudek
Esteemed Contributor III
  • 4 kudos

as @wereners said you need to share the code. If it is dataframe to json probably you need to use StructType - Array to get that list but without code is hard to help.

  • 4 kudos
3 More Replies
Braxx
by Contributor II
  • 8205 Views
  • 12 replies
  • 2 kudos

Resolved! Validate a schema of json in column

I have a dataframe like below with col2 as key-value pairs. I would like to filter col2 to only the rows with a valid schema. There could be many of pairs, sometimes less, sometimes more and this is fine as long as the structure is fine. Nulls in col...

df
  • 8205 Views
  • 12 replies
  • 2 kudos
Latest Reply
Anonymous
Not applicable
  • 2 kudos

@Bartosz Wachocki​ - Thank you for sharing your solution and marking it as best.

  • 2 kudos
11 More Replies
Orianh
by Valued Contributor II
  • 5708 Views
  • 7 replies
  • 3 kudos

Resolved! Read JSON with backslash.

Hello guys.I'm trying to read JSON file which contains backslash and failed to read it via pyspark.Tried a lot of options but didn't solve this yet, I thought to read all the JSON as text and replace all "\" with "/" but pyspark fail to read it as te...

  • 5708 Views
  • 7 replies
  • 3 kudos
Latest Reply
Anonymous
Not applicable
  • 3 kudos

@orian hindi​ - Would you be happy to post the solution you came up with and then mark it as best? That will help other members.

  • 3 kudos
6 More Replies
D3nnisd
by New Contributor III
  • 10454 Views
  • 15 replies
  • 6 kudos

Resolved! BufferHolder Exceeded on Json flattening

On Databricks, we use the following code to flatten JSON in Python. The data is from a REST API:```df = spark.read.format("json").option("header", "true").option("multiline", "true").load(SourceFileFolder + sourcetable + "*.json")df2 = df.select(psf....

  • 10454 Views
  • 15 replies
  • 6 kudos
Latest Reply
Dan_Z
Honored Contributor
  • 6 kudos

@Dennis D​ , what's happening here is that more than 2 GB (2147483648 bytes) is being loaded into a single column value. This is a hard-limit for serialization. This KB article addresses it. The solution would be to find some way to have this loaded ...

  • 6 kudos
14 More Replies
Kaniz_Fatma
by Community Manager
  • 12110 Views
  • 2 replies
  • 1 kudos
  • 12110 Views
  • 2 replies
  • 1 kudos
Latest Reply
Ryan_Chynoweth
Honored Contributor III
  • 1 kudos

Assuming that the S3 bucket is mounted in the workspace you can provide a file path. If you want to write a PySpark DF then you can do something like the following: df.write.format('json').save('/path/to/file_name.json')You could also use the json py...

  • 1 kudos
1 More Replies
User16856693631
by New Contributor II
  • 1147 Views
  • 2 replies
  • 0 kudos

Can you create Clusters via a REST API?

Yes, you can. See here: https://docs.databricks.com/dev-tools/api/latest/clusters.htmlThe JSON payload would look as follows:{ "cluster_name": "my-cluster", "spark_version": "7.3.x-scala2.12", "node_type_id": "i3.xlarge", "spark_conf": { ...

  • 1147 Views
  • 2 replies
  • 0 kudos
Latest Reply
ManishPatil
New Contributor II
  • 0 kudos

One can create a Cluster(s) using CLuster API @ https://docs.databricks.com/dev-tools/api/latest/clusters.html#create However, REST API 2.0 doesn't provide certain features like "Enable Table Access Control", which has been introduced after REST API ...

  • 0 kudos
1 More Replies
MithuWagh
by New Contributor
  • 5676 Views
  • 1 replies
  • 0 kudos

How to deal with column name with .(dot) in pyspark dataframe??

We are streaming data from kafka source with json but in some column we are getting .(dot) in column names.streaming json data: df1 = df.selectExpr("CAST(value AS STRING)") {"pNum":"A14","from":"telecom","payload":{"TARGET":"1","COUNTRY":"India"...

  • 5676 Views
  • 1 replies
  • 0 kudos
Latest Reply
shyam_9
Valued Contributor
  • 0 kudos

Hi @Mithu Wagh you can use backticks to enclose the column name.df.select("`col0.1`")

  • 0 kudos
Yogi
by New Contributor III
  • 8113 Views
  • 15 replies
  • 0 kudos

Resolved! Can we pass Databricks output to Azure function body?

Hi, Can anyone help me with Databricks and Azure function. I'm trying to pass databricks json output to azure function body in ADF job, is it possible? If yes, How? If No, what other alternative to do the same?

  • 8113 Views
  • 15 replies
  • 0 kudos
Latest Reply
AbhishekNarain_
New Contributor III
  • 0 kudos

You can now pass values back to ADF from a notebook.@@Yogi​ Though there is a size limit, so if you are passing dataset of larger than 2MB then rather write it on storage, and consume it directly with Azure Functions. You can pass the file path/ refe...

  • 0 kudos
14 More Replies
Labels