cancel
Showing results for 
Search instead for 
Did you mean: 
Machine Learning
Dive into the world of machine learning on the Databricks platform. Explore discussions on algorithms, model training, deployment, and more. Connect with ML enthusiasts and experts.
cancel
Showing results for 
Search instead for 
Did you mean: 

Created nested struct schema SPARK - Schema Jira

weldermartins
Honored Contributor

Hello guys,

I'm using Jira API to return "ISSUES". But to be able to use pyspark I need to create the Dataframe passing in the Schema. But I am not able to create the Schema based on the model below. Would you have any ideas?

root
 |-- expand: string (nullable = true)
 |-- fields: struct (nullable = true)
 |    |-- aggregateprogress: struct (nullable = true)
 |    |    |-- progress: long (nullable = true)
 |    |    |-- total: long (nullable = true)
 |    |-- aggregatetimeestimate: string (nullable = true)
 |    |-- aggregatetimeoriginalestimate: string (nullable = true)
 |    |-- aggregatetimespent: string (nullable = true)
 |    |-- assignee: string (nullable = true)
 |    |-- attachment: array (nullable = true)
 |    |    |-- element: string (containsNull = true)
 |    |-- comment: struct (nullable = true)
 |    |    |-- comments: array (nullable = true)
 |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |-- author: struct (nullable = true)
 |    |    |    |    |    |-- accountId: string (nullable = true)
 |    |    |    |    |    |-- accountType: string (nullable = true)
 |    |    |    |    |    |-- active: boolean (nullable = true)
 |    |    |    |    |    |-- avatarUrls: struct (nullable = true)
 |    |    |    |    |    |    |-- 16x16: string (nullable = true)
 |    |    |    |    |    |    |-- 24x24: string (nullable = true)
 |    |    |    |    |    |    |-- 32x32: string (nullable = true)
 |    |    |    |    |    |    |-- 48x48: string (nullable = true)
 |    |    |    |    |    |-- displayName: string (nullable = true)
 |    |    |    |    |    |-- emailAddress: string (nullable = true)
 |    |    |    |    |    |-- self: string (nullable = true)
 |    |    |    |    |    |-- timeZone: string (nullable = true)
 |    |    |    |    |-- body: struct (nullable = true)
 |    |    |    |    |    |-- content: array (nullable = true)
 |    |    |    |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |    |    |    |-- content: array (nullable = true)
 |    |    |    |    |    |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |    |    |    |    |    |-- text: string (nullable = true)
 |    |    |    |    |    |    |    |    |    |-- type: string (nullable = true)
 |    |    |    |    |    |    |    |-- type: string (nullable = true)
 |    |    |    |    |    |-- type: string (nullable = true)
 |    |    |    |    |    |-- version: long (nullable = true)
 |    |    |    |    |-- created: string (nullable = true)
 |    |    |    |    |-- id: string (nullable = true)
 |    |    |    |    |-- jsdPublic: boolean (nullable = true)
 |    |    |    |    |-- self: string (nullable = true)
 |    |    |    |    |-- updateAuthor: struct (nullable = true)
 |    |    |    |    |    |-- accountId: string (nullable = true)
 |    |    |    |    |    |-- accountType: string (nullable = true)
 |    |    |    |    |    |-- active: boolean (nullable = true)
 |    |    |    |    |    |-- avatarUrls: struct (nullable = true)
 |    |    |    |    |    |    |-- 16x16: string (nullable = true)
 |    |    |    |    |    |    |-- 24x24: string (nullable = true)
 |    |    |    |    |    |    |-- 32x32: string (nullable = true)
 |    |    |    |    |    |    |-- 48x48: string (nullable = true)
 |    |    |    |    |    |-- displayName: string (nullable = true)
 |    |    |    |    |    |-- emailAddress: string (nullable = true)
 |    |    |    |    |    |-- self: string (nullable = true)
 |    |    |    |    |    |-- timeZone: string (nullable = true)
 |    |    |    |    |-- updated: string (nullable = true)
 |    |    |-- maxResults: long (nullable = true)
 |    |    |-- self: string (nullable = true)
 |    |    |-- startAt: long (nullable = true)
 |    |    |-- total: long (nullable = true)
 |    |-- components: array (nullable = true)
 |    |    |-- element: string (containsNull = true)
 |    |-- created: string (nullable = true)
 |    |-- creator: struct (nullable = true)
 |    |    |-- accountId: string (nullable = true)
 |    |    |-- accountType: string (nullable = true)
 |    |    |-- active: boolean (nullable = true)
 |    |    |-- avatarUrls: struct (nullable = true)
 |    |    |    |-- 16x16: string (nullable = true)
 |    |    |    |-- 24x24: string (nullable = true)
 |    |    |    |-- 32x32: string (nullable = true)
 |    |    |    |-- 48x48: string (nullable = true)
 |    |    |-- displayName: string (nullable = true)
 |    |    |-- emailAddress: string (nullable = true)
 |    |    |-- self: string (nullable = true)
 |    |    |-- timeZone: string (nullable = true)
 |    |-- customfield_10001: string (nullable = true)
 |    |-- customfield_10002: string (nullable = true)
 |    |-- customfield_10003: string (nullable = true)
 |    |-- customfield_10004: string (nullable = true)
 |    |-- customfield_10005: string (nullable = true)
 |    |-- customfield_10006: string (nullable = true)
 |    |-- customfield_10007: string (nullable = true)
 |    |-- customfield_10008: string (nullable = true)
 |    |-- customfield_10009: string (nullable = true)
 |    |-- customfield_10010: string (nullable = true)
 |    |-- customfield_10014: string (nullable = true)
 |    |-- customfield_10015: string (nullable = true)
 |    |-- customfield_10016: string (nullable = true)
 |    |-- customfield_10017: string (nullable = true)
 |    |-- customfield_10018: struct (nullable = true)
 |    |    |-- hasEpicLinkFieldDependency: boolean (nullable = true)
 |    |    |-- nonEditableReason: struct (nullable = true)
 |    |    |    |-- message: string (nullable = true)
 |    |    |    |-- reason: string (nullable = true)
 |    |    |-- showField: boolean (nullable = true)
 |    |-- customfield_10019: string (nullable = true)
 |    |-- customfield_10020: string (nullable = true)
 |    |-- customfield_10021: string (nullable = true)
 |    |-- customfield_10022: string (nullable = true)
 |    |-- customfield_10023: string (nullable = true)
 |    |-- customfield_10024: string (nullable = true)
 |    |-- customfield_10025: string (nullable = true)
 |    |-- customfield_10026: string (nullable = true)
 |    |-- customfield_10027: string (nullable = true)
 |    |-- customfield_10028: string (nullable = true)
 |    |-- customfield_10029: string (nullable = true)
 |    |-- customfield_10030: string (nullable = true)
 |    |-- description: string (nullable = true)
 |    |-- duedate: string (nullable = true)
 |    |-- environment: string (nullable = true)
 |    |-- fixVersions: array (nullable = true)
 |    |    |-- element: string (containsNull = true)
 |    |-- issuelinks: array (nullable = true)
 |    |    |-- element: string (containsNull = true)
 |    |-- issuerestriction: struct (nullable = true)
 |    |    |-- shouldDisplay: boolean (nullable = true)
 |    |-- issuetype: struct (nullable = true)
 |    |    |-- avatarId: long (nullable = true)
 |    |    |-- description: string (nullable = true)
 |    |    |-- entityId: string (nullable = true)
 |    |    |-- hierarchyLevel: long (nullable = true)
 |    |    |-- iconUrl: string (nullable = true)
 |    |    |-- id: string (nullable = true)
 |    |    |-- name: string (nullable = true)
 |    |    |-- self: string (nullable = true)
 |    |    |-- subtask: boolean (nullable = true)
 |    |-- labels: array (nullable = true)
 |    |    |-- element: string (containsNull = true)
 |    |-- lastViewed: string (nullable = true)
 |    |-- priority: struct (nullable = true)
 |    |    |-- iconUrl: string (nullable = true)
 |    |    |-- id: string (nullable = true)
 |    |    |-- name: string (nullable = true)
 |    |    |-- self: string (nullable = true)
 |    |-- progress: struct (nullable = true)
 |    |    |-- progress: long (nullable = true)
 |    |    |-- total: long (nullable = true)
 |    |-- project: struct (nullable = true)
 |    |    |-- avatarUrls: struct (nullable = true)
 |    |    |    |-- 16x16: string (nullable = true)
 |    |    |    |-- 24x24: string (nullable = true)
 |    |    |    |-- 32x32: string (nullable = true)
 |    |    |    |-- 48x48: string (nullable = true)
 |    |    |-- id: string (nullable = true)
 |    |    |-- key: string (nullable = true)
 |    |    |-- name: string (nullable = true)
 |    |    |-- projectTypeKey: string (nullable = true)
 |    |    |-- self: string (nullable = true)
 |    |    |-- simplified: boolean (nullable = true)
 |    |-- reporter: struct (nullable = true)
 |    |    |-- accountId: string (nullable = true)
 |    |    |-- accountType: string (nullable = true)
 |    |    |-- active: boolean (nullable = true)
 |    |    |-- avatarUrls: struct (nullable = true)
 |    |    |    |-- 16x16: string (nullable = true)
 |    |    |    |-- 24x24: string (nullable = true)
 |    |    |    |-- 32x32: string (nullable = true)
 |    |    |    |-- 48x48: string (nullable = true)
 |    |    |-- displayName: string (nullable = true)
 |    |    |-- emailAddress: string (nullable = true)
 |    |    |-- self: string (nullable = true)
 |    |    |-- timeZone: string (nullable = true)
 |    |-- resolution: string (nullable = true)
 |    |-- resolutiondate: string (nullable = true)
 |    |-- security: string (nullable = true)
 |    |-- status: struct (nullable = true)
 |    |    |-- description: string (nullable = true)
 |    |    |-- iconUrl: string (nullable = true)
 |    |    |-- id: string (nullable = true)
 
 |-- id: string (nullable = true)
 |-- key: string (nullable = true)
 |-- self: string (nullable = true)

1 ACCEPTED SOLUTION

Accepted Solutions

Now it's working, when the message returned that it was not parallelized I searched and found the answer. When creating the Dataframe I changed it to:

@Werner Stinckens​  Thanks for the support.

df = spark.read.json(sc.parallelize([answer.text]))

View solution in original post

17 REPLIES 17

weldermartins
Honored Contributor

@Werner Stinckens​ or @Hubert Dudek​ Could you help me?

I don't want all the information, just some. However, I can only do it in a static file.

-werners-
Esteemed Contributor III

you want help on how to define the schema?

Yes, it is returning null values ​​as in the example I showed above.

weldermartins
Honored Contributor

@Werner Stinckens​  If you look at the Schema that was shown above, it has many levels and sub-levels, like: Struct, Array. In this Schema I created is returning only null values, I don't know where I'm going wrong.

schema = StructType([ 
    StructField('fields', StructType([ 
       StructField('comment', StructType([  
        StructField("comments",ArrayType( StructField('body', StringType())),True),
       ])),
    ])),          
    StructField('id', StringType()), 
    StructField('key', StringType()), 
    StructField('self', StringType())
])  
 
 
 
df = spark.createDataFrame([response],schema)
 
 
 
 
 
 
 
df = df.withColumn("fields", explode((("fields"))))\ 
.withColumn("comment", explode((("fields.comment"))))\ 
.withColumn("comments", explode((("comment.comments"))))
 

-werners-
Esteemed Contributor III

Well, your schema seems ok, but I can't tell without the data itself.

Can you read the JSON files from JIRA with schema inference and then compare?

I did the same, copied the json return and saved it to a file to check the schema. But it is not returning the comment data that is in the body.

jira 

-werners-
Esteemed Contributor III

I mean a printschema or something of the json when you read it in a df.

Do you see all the data when you read the json with schema inference?

The file is not static, the information is returned from the API.

jora2

-werners-
Esteemed Contributor III

I'd land the json first (coming from the REST call), and then process it.

Do you now call the API using 'request' or something similar?

includes a picture in your post above, see please.

-werners-
Esteemed Contributor III

save the response as a json on your datalake, read it with spark and you have your schema.

I had already done that, but I would like to consume this data in the Dataframe do the transformations and then save the data in a DB.

-werners-
Esteemed Contributor III

yes, but hence my question: what happens if you read the json with schema inference?

Does that work?

If the JSON files can have a different schema, it is a good idea to use schema inference.

Got it, I'll do it this way. And I get back to you, thank you very much.