cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Forum Posts

rammy
by Contributor III
  • 3435 Views
  • 3 replies
  • 11 kudos

How would i retrieve data JSON data with namespaces using spark SQL?

File.json from the below code contains huge JSON data with each key containing namespace prefix(This JSON file converted from the XML file).I could able to retrieve if JSON does not contain namespaces but what could be the approach to retrieve record...

image.png image
  • 3435 Views
  • 3 replies
  • 11 kudos
Latest Reply
SS2
Valued Contributor
  • 11 kudos

I case of struct you can use (.) For extracting the value

  • 11 kudos
2 More Replies
hare
by New Contributor III
  • 10213 Views
  • 1 replies
  • 1 kudos

Failed to merge incompatible data types

We are processing the josn file from the storage location on every day and it will get archived once the records are appended into the respective tables.source_location_path: "..../mon=05/day=01/fld1" , "..../mon=05/day=01/fld2" ..... "..../mon=05/d...

  • 10213 Views
  • 1 replies
  • 1 kudos
Latest Reply
Shalabh007
Honored Contributor
  • 1 kudos

@Hare Krishnan​ the issues highlighted can easily be handled using the .option("mergeSchema", "true") at the time of reading all the files.Sample code:spark.read.option("mergeSchema", "true").json(<file paths>, multiLine=True)The only scenario this w...

  • 1 kudos
andreiten
by New Contributor II
  • 5284 Views
  • 1 replies
  • 3 kudos

Is there any example or guideline how to pass JSON parameters to the pipeline in Databricks workflow?

I used this source https://docs.databricks.com/workflows/jobs/jobs.html#:~:text=You%20can%20use%20Run%20Now,different%20values%20for%20existing%20parameters.&text=next%20to%20Run%20Now%20and,on%20the%20type%20of%20task. But there is no example of how...

  • 5284 Views
  • 1 replies
  • 3 kudos
Latest Reply
UmaMahesh1
Honored Contributor III
  • 3 kudos

Hi @Andre Ten​ That's exactly how you specify the json parameters in databricks workflow. I have been doing in the same format and it works for me..removed the parameters as it is a bit sensitive. But I hope you get the point.Cheers.

  • 3 kudos
Kavin
by New Contributor II
  • 1942 Views
  • 1 replies
  • 2 kudos

Issue converting the datasets into JSON

Im a newbie to Databricks, I need to convert the data sets into JSON. i tried bth FOR JSON AUTO AND FOR JSON PATH, However im getting an issue - [PARSE_SYNTAX_ERROR] Syntax error at or near 'json'line My Query works fine without FOR JSON AUTO AND FOR...

  • 1942 Views
  • 1 replies
  • 2 kudos
Latest Reply
Debayan
Databricks Employee
  • 2 kudos

Hi @Kavin Natarajan​ , Could you please go through https://www.tutorialkart.com/apache-spark/spark-write-dataset-to-json-file-example/ , looks like the steps are okay.

  • 2 kudos
hare
by New Contributor III
  • 4407 Views
  • 1 replies
  • 5 kudos

"Databricks" - "PySpark" - Read "JSON" file - Azure Blob container - "APPEND BLOB"

Hi All, We are getting JSON files in Azure blob container and its "Blob Type" is "Append Blob".We are getting an error "AnalysisException: Unable to infer schema for JSON. It must be specified manually.", when we try to read using below mentioned scr...

  • 4407 Views
  • 1 replies
  • 5 kudos
Latest Reply
User16856839485
Databricks Employee
  • 5 kudos

There currently does not appear to be direct support for append blob reads, however, converting the append blob to block blob [and then parquet or delta, etc.] are a viable option:https://kb.databricks.com/en_US/data-sources/wasb-check-blob-types?_ga...

  • 5 kudos
Data_Engineer3
by Contributor III
  • 6302 Views
  • 4 replies
  • 1 kudos

Unable to read data from Elasticsearch with spark in Databricks.

When I am trying to read data from elasticsearch by spark sql, it throw an error like RuntimeException: Error while encoding: java.lang.RuntimeException: scala.collection.convert.Wrappers$JListWrapper is not a valid external type for schema of string...

  • 6302 Views
  • 4 replies
  • 1 kudos
Latest Reply
Vidula
Honored Contributor
  • 1 kudos

Hi there @KARTHICK N​ Hope all is well! Just wanted to check in if you were able to resolve your issue and would you be happy to share the solution or mark an answer as best? Else please let us know if you need more help. We'd love to hear from you.T...

  • 1 kudos
3 More Replies
rdobbss
by New Contributor II
  • 1669 Views
  • 2 replies
  • 0 kudos

RPC Disassociate error due to container threshold exceeding and garbage collector error when reading 23 gb multiline JSON file.

I am reading 23 gb multi line json file and flattening it using udf and writing datframe as parquet using psypark.Cluster I am using is 3 node (8 core) 64gb memory with limit to go upto 8 nodes.I am able to process 7gb file with no issue and takes ar...

  • 1669 Views
  • 2 replies
  • 0 kudos
Latest Reply
Vidula
Honored Contributor
  • 0 kudos

Hi @Ravi Dobariya​ Hope all is well! Just wanted to check in if you were able to resolve your issue and would you be happy to share the solution or mark an answer as best? Else please let us know if you need more help. We'd love to hear from you.Than...

  • 0 kudos
1 More Replies
KarimSegura
by New Contributor III
  • 3309 Views
  • 2 replies
  • 4 kudos

databricks-connect throws an exception when showing a dataframe with json content

I'm facing an issue when I want to show a dataframe with JSON content.All this happens when the script runs in databricks-connect from VS Code.Basically, I would like any help or guidance to get this run as it should be. Thanks in advance.This is how...

  • 3309 Views
  • 2 replies
  • 4 kudos
Latest Reply
KarimSegura
New Contributor III
  • 4 kudos

The code works fine on databricks cluster, but this code is part of a unit test in local env. then submitted to a branch->PR->merged into master branch.Thanks for the advice on using DBX. I will give DBX a try again even though I've already tried.I'l...

  • 4 kudos
1 More Replies
laus
by New Contributor III
  • 9153 Views
  • 6 replies
  • 3 kudos

Resolved! How to load a json file in pyspark with colon character in file name

Hi,I'm trying to load this json file which contains the colon character in its name: file_name.2022-03-05_11:30:00.json but I get the error in screenshot below saying that there is a relative path in an absolute url - Any idea how to read this file...

image
  • 9153 Views
  • 6 replies
  • 3 kudos
Latest Reply
Noopur_Nigam
Databricks Employee
  • 3 kudos

Hi @Laura Blancarte​ I hope that @Pearl Ubaru​'s answer would have helped you in resolving your issue.Please let us know if you need more help on this.

  • 3 kudos
5 More Replies
Kash
by Contributor III
  • 17883 Views
  • 18 replies
  • 13 kudos

Resolved! HELP! Converting GZ JSON to Delta causes massive CPU spikes and ETL's take days!

Hi there,I was wondering if I could get your advise.We would like to create a bronze delta table using GZ JSON data stored in S3 but each time we attempt to read and write it our clusters CPU spikes to 100%. We are not doing any transformations but s...

  • 17883 Views
  • 18 replies
  • 13 kudos
Latest Reply
Kash
Contributor III
  • 13 kudos

Hi Kaniz,Thanks for the note and thank you everyone for the suggestions and help. @Joseph Kambourakis​ I aded your suggestion to our load but I did not see any change in how our data loads or the time it takes to load data. I've done some additional ...

  • 13 kudos
17 More Replies
MattM
by New Contributor III
  • 2488 Views
  • 0 replies
  • 0 kudos

Unstructured Data - PDF and a semi-structured data

I have a scenario where one source is unstructered pdf files and another source is semi-structered JSON files. I get files from these two sources on a daily basis into an ADLS storage. What is the best way to load this into a medallion structure by s...

  • 2488 Views
  • 0 replies
  • 0 kudos
steelman
by New Contributor III
  • 13875 Views
  • 6 replies
  • 8 kudos

Resolved! how to flatten non standard Json files in a dataframe

hello, I have a non standard Json file with a nested file structure that I have issues with. Here is an example of the json file. jsonfile= """[ { "success":true, "numRows":2, "data":{ "58251":{ "invoiceno":"58...

desired format in the dataframe after processing the json file
  • 13875 Views
  • 6 replies
  • 8 kudos
Latest Reply
Deepak_Bhutada
Contributor III
  • 8 kudos

@stale stokkereit​ You can use the below function to flatten the struct fieldimport pyspark.sql.functions as F   def flatten_df(nested_df): flat_cols = [c[0] for c in nested_df.dtypes if c[1][:6] != 'struct'] nested_cols = [c[0] for c in nest...

  • 8 kudos
5 More Replies
Devarsh
by Contributor
  • 9469 Views
  • 3 replies
  • 7 kudos

Resolved! Getting the error 'No such file or directory', when trying to access the json file

I am trying to write in my google sheet through Databricks but when it comes to reading the json, file containing the credentials, I am getting the error that No such file or directory exists.import gspread     gc = gspread.service_account(filename='...

  • 9469 Views
  • 3 replies
  • 7 kudos
Latest Reply
Noopur_Nigam
Databricks Employee
  • 7 kudos

Hi @Devarsh Shah​ The issue is not with json file but the location you are specifying while reading.As suggested by @Werner Stinckens​ please start using spark API to read the json file as below:spark.read.format("json").load("testjson")Please check ...

  • 7 kudos
2 More Replies
repcak
by New Contributor III
  • 5643 Views
  • 4 replies
  • 3 kudos

Resolved! Delta Live Tables with EventHub

Hello,I would like to integrate Databricks Delta Live Tables with Eventhub, but i cannot install com.microsoft.azure:azure-eventhubs-spark_2.12:2.3.17 on delta live cluster.I tried installed in using Init script (by adding it in Json cluster settings...

image
  • 5643 Views
  • 4 replies
  • 3 kudos
Latest Reply
Atanu
Databricks Employee
  • 3 kudos

I think this has some details https://docs.microsoft.com/en-us/azure/event-hubs/event-hubs-kafka-spark-tutorial @Kacper Mucha​ is the issue resolved ?

  • 3 kudos
3 More Replies
AmanSehgal
by Honored Contributor III
  • 7457 Views
  • 1 replies
  • 10 kudos

Resolved! How to merge all the columns into one column as JSON?

I have a task to transform a dataframe. The task is to collect all the columns in a row and embed it into a JSON string as a column.Source DF:Target DF: 

image image
  • 7457 Views
  • 1 replies
  • 10 kudos
Latest Reply
AmanSehgal
Honored Contributor III
  • 10 kudos

I was able to do this by converting df to rdd and then by applying map function to it.rdd_1 = df.rdd.map(lambda row: (row['ID'], row.asDict() ) )   ...

  • 10 kudos
Labels