Data Engineering

Forum Posts

Sorted by:

by Jana • New Contributor III

02-15-2022 9:26:54 AM

8703 Views
9 replies
4 kudos

Resolved! Parsing 5 GB json file is running long on cluster

I was creating delta table from ADLS json input file. but the job was running long while creating delta table from json. Below is my cluster configuration. Is the issue related to cluster config ? Do I need to upgrade the cluster config ?The cluster ...

Data Engineering

8703 Views
9 replies
4 kudos

02-15-2022 9:26:54 AM

View Replies

Latest Reply

-werners-
Esteemed Contributor III

03-01-2022 12:48:29 AM

4 kudos

with multiline = true, the json is read as a whole and processed as such.I'd try with a beefier cluster.

4 kudos

03-01-2022 12:48:29 AM

8 More Replies

by SRK • Contributor III

10-01-2022 3:15:10 AM

3764 Views
5 replies
7 kudos

How to handle schema validation for Json file. Using Databricks Autoloader?

Following are the details of the requirement:1. I am using databricks notebook to read data from Kafka topic and writing into ADLS Gen2 container i.e., my landing layer.2. I am using Spark code to read data from Kafka and write into landing...

Data Engineering

3764 Views
5 replies
7 kudos

10-01-2022 3:15:10 AM

View Replies

Latest Reply

maddy08
New Contributor II

10-24-2024 10:01:27 PM

7 kudos

just to clarify, are you reading kafka and writing into adls in json files? like for each message from kafka is 1 json file in adls ?

7 kudos

10-24-2024 10:01:27 PM

4 More Replies

by AmineHY • Contributor

11-16-2022 5:24:01 AM

11836 Views
5 replies
6 kudos

Resolved! How to read JSON files embedded in a list of lists?

HelloI am trying to read this JSON file but didn't succeed You can see the head of the file, JSON inside a list of lists. Any idea how to read this file?

Data Engineering

11836 Views
5 replies
6 kudos

11-16-2022 5:24:01 AM

View Replies

Latest Reply

adriennn
Valued Contributor

09-12-2024 10:32:36 PM

6 kudos

The correct way to do this without using open, which will work only with local/mounted files is to read the files as binaryfile and then you will get the entire json string on each row, from there you can use from_json() and explode() to extract the ...

6 kudos

09-12-2024 10:32:36 PM

4 More Replies

by PK225 • New Contributor III

06-07-2023 10:34:46 AM

1575 Views
2 replies
1 kudos

Resolved! when reading Json file into DF , want to see data into rows wise, What be the solution

Data Engineering

1575 Views
2 replies
1 kudos

06-07-2023 10:34:46 AM

View Replies

Latest Reply

Vartika
Databricks Employee

06-09-2023 4:28:34 AM

1 kudos

Hi @Pavan Kumar,Hope you are well. Just wanted to see if you were able to find an answer to your question and would you like to mark an answer as best? It would be really helpful for the other members too.Cheers!

1 kudos

06-09-2023 4:28:34 AM

1 More Replies

by BamBam • New Contributor III

05-18-2023 5:18:25 AM

2157 Views
0 replies
0 kudos

Trying to convert STRING column into Array of Structs in SQL statement

I have STRING column in a DLT table that was loaded using SQL Autoloader via a JSON file. When I use the "schema_of_json" function in a SQL statement passing in the literal string from the STRING column then I get this output:ARRAY<STRUCT<firstFetchD...

Data Engineering

2157 Views
0 replies
0 kudos

05-18-2023 5:18:25 AM

by rusty9876543 • New Contributor II

04-12-2023 11:30:30 AM

7556 Views
5 replies
2 kudos

Split dataFrame into 1MB chunks and create a single json array with each row in chunk being an array element

Hi, I have a dataFrame that I've been able to convert into a struct with each row being a JSON object.I want the ability to split the data frame into 1MB chunks. Once I have the chunks, I would like to add all rows in each respective chunk into a sin...

Data Engineering

7556 Views
5 replies
2 kudos

04-12-2023 11:30:30 AM

View Replies

Latest Reply

Anonymous
Not applicable

04-16-2023 12:23:34 AM

2 kudos

@Tamoor Mirza :You can use the to_json method of a DataFrame to convert each chunk to a JSON string, and then append those JSON strings to a list. Here is an example code snippet that splits a DataFrame into 1MB chunks and creates a list of JSON arr...

2 kudos

04-16-2023 12:23:34 AM

4 More Replies

by vicusbass • New Contributor II

04-13-2023 11:54:50 PM

18236 Views
3 replies
1 kudos

How to extract values from JSON array field?

Hi,While creating an SQL notebook, I am struggling with extracting some values from a JSON array field. I need to create a view where a field would be an array with values extracted from a field like the one below, specifically I need the `value` fi...

Data Engineering

18236 Views
3 replies
1 kudos

04-13-2023 11:54:50 PM

View Replies

Latest Reply

vicusbass
New Contributor II

04-14-2023 9:26:46 AM

1 kudos

Maybe I didn't explain it correctly. The JSON snippet from the description is a cell from a row from a table.

1 kudos

04-14-2023 9:26:46 AM

2 More Replies

by kk007 • New Contributor III

04-07-2023 10:19:36 AM

3396 Views
4 replies
4 kudos

Photon engine throws error "JSON document exceeded maximum allowed size 400.0 MiB"

I am reading a 83MB json file using " spark.read.json(storage_path)", when I display the data is seems displaying fine, but when I try command line count, it complains about file size , being more than 400MB, which is not true.Photon JSON reader erro...

Data Engineering

3396 Views
4 replies
4 kudos

04-07-2023 10:19:36 AM

View Replies

Latest Reply

Anonymous
Not applicable

04-09-2023 8:47:33 AM

4 kudos

@Kamal Kumar :The error message suggests that the JSON document size is exceeding the maximum allowed size of 400MB. This could be caused by one or more documents in your JSON file being larger than this limit. It is not a bug, but a limitation set ...

4 kudos

04-09-2023 8:47:33 AM

3 More Replies

by Galdino • New Contributor II

05-15-2022 2:29:06 PM

5173 Views
3 replies
1 kudos

How to read a json from BytesIO with PySpark?

I want read a json from IO variable using PySpark.My code using pandas:io = BytesIO()ftp.retrbinary('RETR '+ file_name, io.write)io.seek(0)# With pandasdf = pd.read_json(io)What I tried using PySpark, but don't work: io = BytesIO() ftp.retrbinary('...

Data Engineering

5173 Views
3 replies
1 kudos

05-15-2022 2:29:06 PM

View Replies

Latest Reply

Erik_L
Contributor II

03-16-2023 11:57:30 AM

1 kudos

Just use pandas and follow with spark.createDataFrame(df)

1 kudos

03-16-2023 11:57:30 AM

2 More Replies

by Sameer_876675 • New Contributor III

12-07-2022 4:22:17 AM

5286 Views
3 replies
2 kudos

How to efficiently process a 100GiB JSON nested file and store it in Delta?

Hi, I'm a fairly new user and I am using Azure Databricks to process a ~1000GiB JSON nested file containing insurance policy data. I uploaded the JSON file to Azure Data Lake Gen2 storage and read the JSON file into a dataframe.df=spark.read.option("...

Data Engineering

5286 Views
3 replies
2 kudos

12-07-2022 4:22:17 AM

View Replies

Latest Reply

Annapurna_Hiriy
Databricks Employee

01-31-2023 8:20:49 AM

2 kudos

Hi Sameer, please refer to following documents on how to work with nested json:https://docs.databricks.com/optimizations/semi-structured.htmlhttps://learn.microsoft.com/en-us/azure/databricks/kb/_static/notebooks/scala/nested-json-to-dataframe.html

2 kudos

01-31-2023 8:20:49 AM

2 More Replies

by AndriusVitkausk • New Contributor III

12-07-2022 5:31:55 AM

1641 Views
1 replies
0 kudos

Reading multi-dimensional json files

So I've been having some issues reading a json file that's been provided to the business with another nesting layer, so instead of a json being an:'array of objects' -> [ {} ,{} ,{} ] It's an 'array of arrays of objects' -> [ [ {}, {} ,{} ], [ {} ,{}...

Data Engineering

1641 Views
1 replies
0 kudos

12-07-2022 5:31:55 AM

View Replies

Latest Reply

ashish1
New Contributor III

01-30-2023 1:20:09 PM

0 kudos

You can use the explode function to flatten the array to rows, can you post a simple example of your data?

0 kudos

01-30-2023 1:20:09 PM

by dulu • New Contributor III

12-09-2022 7:38:27 PM

3800 Views
2 replies
6 kudos

Is there a function similar to split_part, json_extract_scalar?

I am using spark_sql version 3.2.1. Is there a function that can replacesplit_part,json_extract_scalarare not?

Data Engineering

3800 Views
2 replies
6 kudos

12-09-2022 7:38:27 PM

View Replies

Latest Reply

Ankush
New Contributor II

12-10-2022 3:55:00 AM

6 kudos

pyspark.sql.functions.get_json_object(col, path)[source]Extracts json object from a json string based on json path specified, and returns json string of the extracted json object. It will return null if the input json string is invalid.

6 kudos

12-10-2022 3:55:00 AM

1 More Replies

by Gilg • Contributor II

12-13-2022 9:07:10 PM

5680 Views
4 replies
5 kudos

Avro Deserialization from Event Hub capture and Autoloader

Hi All,I am getting data from Event Hub capture in Avro format and using Auto Loader to process it.I get into the point where I can read the Avro by casting the Body into a string.Now I wanted to deserialized the Body column so it will in table forma...

Data Engineering

5680 Views
4 replies
5 kudos

12-13-2022 9:07:10 PM

View Replies

Latest Reply

UmaMahesh1
Honored Contributor III

12-13-2022 9:43:46 PM

5 kudos

If you still want to go with the above approach and don't want to provide schema manually, then you can fetch a tiny batch with 1 record and build the schema into a variable using a .schema option. Once done, you can add a new Body column by providin...

5 kudos

12-13-2022 9:43:46 PM

3 More Replies

by antonyj453 • New Contributor II

12-12-2022 7:01:36 AM

2639 Views
1 replies
3 kudos

How to extract JSON object from a pyspark data frame. I was able to extract data from another column which in array format using "Explode" function, but Explode is not working for Object type. Its returning with type mismatch error.

I have tried below code to extract data which in Array:df2 = df_deidentifieddocuments_tst.select(F.explode('annotationId').alias('annotationId')).select('annotationId.$oid')It was working fine.. but,its not working for JSON object type. Below is colu...

Data Engineering

2639 Views
1 replies
3 kudos

12-12-2022 7:01:36 AM

View Replies

Latest Reply

UmaMahesh1
Honored Contributor III

12-12-2022 11:28:56 PM

3 kudos

Did you try extracting that column data using from_json function ?

3 kudos

12-12-2022 11:28:56 PM

by Sujitha • Databricks Employee

12-09-2022 12:20:05 AM

2248 Views
6 replies
5 kudos

KB Feedback Discussion In addition to the Databricks Community, we have a Support team that maintains a Knowledge Base (KB). The KB contains answers ...

KB Feedback Discussion In addition to the Databricks Community, we have a Support team that maintains a Knowledge Base (KB). The KB contains answers to common questions about Databricks, as well as information on optimisation and troubleshooting.Thes...

Data Engineering

2248 Views
6 replies
5 kudos

12-09-2022 12:20:05 AM

View Replies

Latest Reply

Ajay-Pandey
Esteemed Contributor III

12-09-2022 1:03:07 AM

5 kudos

Thanks for sharing @Sujitha Ramamoorthy

5 kudos

12-09-2022 1:03:07 AM

5 More Replies