cancel
Showing results for 
Search instead for 
Did you mean: 
Machine Learning
cancel
Showing results for 
Search instead for 
Did you mean: 

Created nested struct schema SPARK - Schema Jira

weldermartins
Honored Contributor

Hello guys,

I'm using Jira API to return "ISSUES". But to be able to use pyspark I need to create the Dataframe passing in the Schema. But I am not able to create the Schema based on the model below. Would you have any ideas?

root
 |-- expand: string (nullable = true)
 |-- fields: struct (nullable = true)
 |    |-- aggregateprogress: struct (nullable = true)
 |    |    |-- progress: long (nullable = true)
 |    |    |-- total: long (nullable = true)
 |    |-- aggregatetimeestimate: string (nullable = true)
 |    |-- aggregatetimeoriginalestimate: string (nullable = true)
 |    |-- aggregatetimespent: string (nullable = true)
 |    |-- assignee: string (nullable = true)
 |    |-- attachment: array (nullable = true)
 |    |    |-- element: string (containsNull = true)
 |    |-- comment: struct (nullable = true)
 |    |    |-- comments: array (nullable = true)
 |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |-- author: struct (nullable = true)
 |    |    |    |    |    |-- accountId: string (nullable = true)
 |    |    |    |    |    |-- accountType: string (nullable = true)
 |    |    |    |    |    |-- active: boolean (nullable = true)
 |    |    |    |    |    |-- avatarUrls: struct (nullable = true)
 |    |    |    |    |    |    |-- 16x16: string (nullable = true)
 |    |    |    |    |    |    |-- 24x24: string (nullable = true)
 |    |    |    |    |    |    |-- 32x32: string (nullable = true)
 |    |    |    |    |    |    |-- 48x48: string (nullable = true)
 |    |    |    |    |    |-- displayName: string (nullable = true)
 |    |    |    |    |    |-- emailAddress: string (nullable = true)
 |    |    |    |    |    |-- self: string (nullable = true)
 |    |    |    |    |    |-- timeZone: string (nullable = true)
 |    |    |    |    |-- body: struct (nullable = true)
 |    |    |    |    |    |-- content: array (nullable = true)
 |    |    |    |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |    |    |    |-- content: array (nullable = true)
 |    |    |    |    |    |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |    |    |    |    |    |-- text: string (nullable = true)
 |    |    |    |    |    |    |    |    |    |-- type: string (nullable = true)
 |    |    |    |    |    |    |    |-- type: string (nullable = true)
 |    |    |    |    |    |-- type: string (nullable = true)
 |    |    |    |    |    |-- version: long (nullable = true)
 |    |    |    |    |-- created: string (nullable = true)
 |    |    |    |    |-- id: string (nullable = true)
 |    |    |    |    |-- jsdPublic: boolean (nullable = true)
 |    |    |    |    |-- self: string (nullable = true)
 |    |    |    |    |-- updateAuthor: struct (nullable = true)
 |    |    |    |    |    |-- accountId: string (nullable = true)
 |    |    |    |    |    |-- accountType: string (nullable = true)
 |    |    |    |    |    |-- active: boolean (nullable = true)
 |    |    |    |    |    |-- avatarUrls: struct (nullable = true)
 |    |    |    |    |    |    |-- 16x16: string (nullable = true)
 |    |    |    |    |    |    |-- 24x24: string (nullable = true)
 |    |    |    |    |    |    |-- 32x32: string (nullable = true)
 |    |    |    |    |    |    |-- 48x48: string (nullable = true)
 |    |    |    |    |    |-- displayName: string (nullable = true)
 |    |    |    |    |    |-- emailAddress: string (nullable = true)
 |    |    |    |    |    |-- self: string (nullable = true)
 |    |    |    |    |    |-- timeZone: string (nullable = true)
 |    |    |    |    |-- updated: string (nullable = true)
 |    |    |-- maxResults: long (nullable = true)
 |    |    |-- self: string (nullable = true)
 |    |    |-- startAt: long (nullable = true)
 |    |    |-- total: long (nullable = true)
 |    |-- components: array (nullable = true)
 |    |    |-- element: string (containsNull = true)
 |    |-- created: string (nullable = true)
 |    |-- creator: struct (nullable = true)
 |    |    |-- accountId: string (nullable = true)
 |    |    |-- accountType: string (nullable = true)
 |    |    |-- active: boolean (nullable = true)
 |    |    |-- avatarUrls: struct (nullable = true)
 |    |    |    |-- 16x16: string (nullable = true)
 |    |    |    |-- 24x24: string (nullable = true)
 |    |    |    |-- 32x32: string (nullable = true)
 |    |    |    |-- 48x48: string (nullable = true)
 |    |    |-- displayName: string (nullable = true)
 |    |    |-- emailAddress: string (nullable = true)
 |    |    |-- self: string (nullable = true)
 |    |    |-- timeZone: string (nullable = true)
 |    |-- customfield_10001: string (nullable = true)
 |    |-- customfield_10002: string (nullable = true)
 |    |-- customfield_10003: string (nullable = true)
 |    |-- customfield_10004: string (nullable = true)
 |    |-- customfield_10005: string (nullable = true)
 |    |-- customfield_10006: string (nullable = true)
 |    |-- customfield_10007: string (nullable = true)
 |    |-- customfield_10008: string (nullable = true)
 |    |-- customfield_10009: string (nullable = true)
 |    |-- customfield_10010: string (nullable = true)
 |    |-- customfield_10014: string (nullable = true)
 |    |-- customfield_10015: string (nullable = true)
 |    |-- customfield_10016: string (nullable = true)
 |    |-- customfield_10017: string (nullable = true)
 |    |-- customfield_10018: struct (nullable = true)
 |    |    |-- hasEpicLinkFieldDependency: boolean (nullable = true)
 |    |    |-- nonEditableReason: struct (nullable = true)
 |    |    |    |-- message: string (nullable = true)
 |    |    |    |-- reason: string (nullable = true)
 |    |    |-- showField: boolean (nullable = true)
 |    |-- customfield_10019: string (nullable = true)
 |    |-- customfield_10020: string (nullable = true)
 |    |-- customfield_10021: string (nullable = true)
 |    |-- customfield_10022: string (nullable = true)
 |    |-- customfield_10023: string (nullable = true)
 |    |-- customfield_10024: string (nullable = true)
 |    |-- customfield_10025: string (nullable = true)
 |    |-- customfield_10026: string (nullable = true)
 |    |-- customfield_10027: string (nullable = true)
 |    |-- customfield_10028: string (nullable = true)
 |    |-- customfield_10029: string (nullable = true)
 |    |-- customfield_10030: string (nullable = true)
 |    |-- description: string (nullable = true)
 |    |-- duedate: string (nullable = true)
 |    |-- environment: string (nullable = true)
 |    |-- fixVersions: array (nullable = true)
 |    |    |-- element: string (containsNull = true)
 |    |-- issuelinks: array (nullable = true)
 |    |    |-- element: string (containsNull = true)
 |    |-- issuerestriction: struct (nullable = true)
 |    |    |-- shouldDisplay: boolean (nullable = true)
 |    |-- issuetype: struct (nullable = true)
 |    |    |-- avatarId: long (nullable = true)
 |    |    |-- description: string (nullable = true)
 |    |    |-- entityId: string (nullable = true)
 |    |    |-- hierarchyLevel: long (nullable = true)
 |    |    |-- iconUrl: string (nullable = true)
 |    |    |-- id: string (nullable = true)
 |    |    |-- name: string (nullable = true)
 |    |    |-- self: string (nullable = true)
 |    |    |-- subtask: boolean (nullable = true)
 |    |-- labels: array (nullable = true)
 |    |    |-- element: string (containsNull = true)
 |    |-- lastViewed: string (nullable = true)
 |    |-- priority: struct (nullable = true)
 |    |    |-- iconUrl: string (nullable = true)
 |    |    |-- id: string (nullable = true)
 |    |    |-- name: string (nullable = true)
 |    |    |-- self: string (nullable = true)
 |    |-- progress: struct (nullable = true)
 |    |    |-- progress: long (nullable = true)
 |    |    |-- total: long (nullable = true)
 |    |-- project: struct (nullable = true)
 |    |    |-- avatarUrls: struct (nullable = true)
 |    |    |    |-- 16x16: string (nullable = true)
 |    |    |    |-- 24x24: string (nullable = true)
 |    |    |    |-- 32x32: string (nullable = true)
 |    |    |    |-- 48x48: string (nullable = true)
 |    |    |-- id: string (nullable = true)
 |    |    |-- key: string (nullable = true)
 |    |    |-- name: string (nullable = true)
 |    |    |-- projectTypeKey: string (nullable = true)
 |    |    |-- self: string (nullable = true)
 |    |    |-- simplified: boolean (nullable = true)
 |    |-- reporter: struct (nullable = true)
 |    |    |-- accountId: string (nullable = true)
 |    |    |-- accountType: string (nullable = true)
 |    |    |-- active: boolean (nullable = true)
 |    |    |-- avatarUrls: struct (nullable = true)
 |    |    |    |-- 16x16: string (nullable = true)
 |    |    |    |-- 24x24: string (nullable = true)
 |    |    |    |-- 32x32: string (nullable = true)
 |    |    |    |-- 48x48: string (nullable = true)
 |    |    |-- displayName: string (nullable = true)
 |    |    |-- emailAddress: string (nullable = true)
 |    |    |-- self: string (nullable = true)
 |    |    |-- timeZone: string (nullable = true)
 |    |-- resolution: string (nullable = true)
 |    |-- resolutiondate: string (nullable = true)
 |    |-- security: string (nullable = true)
 |    |-- status: struct (nullable = true)
 |    |    |-- description: string (nullable = true)
 |    |    |-- iconUrl: string (nullable = true)
 |    |    |-- id: string (nullable = true)
 
 |-- id: string (nullable = true)
 |-- key: string (nullable = true)
 |-- self: string (nullable = true)

1 ACCEPTED SOLUTION

Accepted Solutions

Now it's working, when the message returned that it was not parallelized I searched and found the answer. When creating the Dataframe I changed it to:

@Werner Stinckens​  Thanks for the support.

df = spark.read.json(sc.parallelize([answer.text]))

View solution in original post

17 REPLIES 17

weldermartins
Honored Contributor

@Werner Stinckens​ or @Hubert Dudek​ Could you help me?

I don't want all the information, just some. However, I can only do it in a static file.

-werners-
Esteemed Contributor III

you want help on how to define the schema?

Yes, it is returning null values ​​as in the example I showed above.

weldermartins
Honored Contributor

@Werner Stinckens​  If you look at the Schema that was shown above, it has many levels and sub-levels, like: Struct, Array. In this Schema I created is returning only null values, I don't know where I'm going wrong.

schema = StructType([ 
    StructField('fields', StructType([ 
       StructField('comment', StructType([  
        StructField("comments",ArrayType( StructField('body', StringType())),True),
       ])),
    ])),          
    StructField('id', StringType()), 
    StructField('key', StringType()), 
    StructField('self', StringType())
])  
 
 
 
df = spark.createDataFrame([response],schema)
 
 
 
 
 
 
 
df = df.withColumn("fields", explode((("fields"))))\ 
.withColumn("comment", explode((("fields.comment"))))\ 
.withColumn("comments", explode((("comment.comments"))))
 

-werners-
Esteemed Contributor III

Well, your schema seems ok, but I can't tell without the data itself.

Can you read the JSON files from JIRA with schema inference and then compare?

I did the same, copied the json return and saved it to a file to check the schema. But it is not returning the comment data that is in the body.

jira 

-werners-
Esteemed Contributor III

I mean a printschema or something of the json when you read it in a df.

Do you see all the data when you read the json with schema inference?

The file is not static, the information is returned from the API.

jora2

-werners-
Esteemed Contributor III

I'd land the json first (coming from the REST call), and then process it.

Do you now call the API using 'request' or something similar?

includes a picture in your post above, see please.

-werners-
Esteemed Contributor III

save the response as a json on your datalake, read it with spark and you have your schema.

I had already done that, but I would like to consume this data in the Dataframe do the transformations and then save the data in a DB.

-werners-
Esteemed Contributor III

yes, but hence my question: what happens if you read the json with schema inference?

Does that work?

If the JSON files can have a different schema, it is a good idea to use schema inference.

Got it, I'll do it this way. And I get back to you, thank you very much.

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.