10-08-2022 06:20 AM
Hello guys,
I'm using Jira API to return "ISSUES". But to be able to use pyspark I need to create the Dataframe passing in the Schema. But I am not able to create the Schema based on the model below. Would you have any ideas?
root
|-- expand: string (nullable = true)
|-- fields: struct (nullable = true)
| |-- aggregateprogress: struct (nullable = true)
| | |-- progress: long (nullable = true)
| | |-- total: long (nullable = true)
| |-- aggregatetimeestimate: string (nullable = true)
| |-- aggregatetimeoriginalestimate: string (nullable = true)
| |-- aggregatetimespent: string (nullable = true)
| |-- assignee: string (nullable = true)
| |-- attachment: array (nullable = true)
| | |-- element: string (containsNull = true)
| |-- comment: struct (nullable = true)
| | |-- comments: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- author: struct (nullable = true)
| | | | | |-- accountId: string (nullable = true)
| | | | | |-- accountType: string (nullable = true)
| | | | | |-- active: boolean (nullable = true)
| | | | | |-- avatarUrls: struct (nullable = true)
| | | | | | |-- 16x16: string (nullable = true)
| | | | | | |-- 24x24: string (nullable = true)
| | | | | | |-- 32x32: string (nullable = true)
| | | | | | |-- 48x48: string (nullable = true)
| | | | | |-- displayName: string (nullable = true)
| | | | | |-- emailAddress: string (nullable = true)
| | | | | |-- self: string (nullable = true)
| | | | | |-- timeZone: string (nullable = true)
| | | | |-- body: struct (nullable = true)
| | | | | |-- content: array (nullable = true)
| | | | | | |-- element: struct (containsNull = true)
| | | | | | | |-- content: array (nullable = true)
| | | | | | | | |-- element: struct (containsNull = true)
| | | | | | | | | |-- text: string (nullable = true)
| | | | | | | | | |-- type: string (nullable = true)
| | | | | | | |-- type: string (nullable = true)
| | | | | |-- type: string (nullable = true)
| | | | | |-- version: long (nullable = true)
| | | | |-- created: string (nullable = true)
| | | | |-- id: string (nullable = true)
| | | | |-- jsdPublic: boolean (nullable = true)
| | | | |-- self: string (nullable = true)
| | | | |-- updateAuthor: struct (nullable = true)
| | | | | |-- accountId: string (nullable = true)
| | | | | |-- accountType: string (nullable = true)
| | | | | |-- active: boolean (nullable = true)
| | | | | |-- avatarUrls: struct (nullable = true)
| | | | | | |-- 16x16: string (nullable = true)
| | | | | | |-- 24x24: string (nullable = true)
| | | | | | |-- 32x32: string (nullable = true)
| | | | | | |-- 48x48: string (nullable = true)
| | | | | |-- displayName: string (nullable = true)
| | | | | |-- emailAddress: string (nullable = true)
| | | | | |-- self: string (nullable = true)
| | | | | |-- timeZone: string (nullable = true)
| | | | |-- updated: string (nullable = true)
| | |-- maxResults: long (nullable = true)
| | |-- self: string (nullable = true)
| | |-- startAt: long (nullable = true)
| | |-- total: long (nullable = true)
| |-- components: array (nullable = true)
| | |-- element: string (containsNull = true)
| |-- created: string (nullable = true)
| |-- creator: struct (nullable = true)
| | |-- accountId: string (nullable = true)
| | |-- accountType: string (nullable = true)
| | |-- active: boolean (nullable = true)
| | |-- avatarUrls: struct (nullable = true)
| | | |-- 16x16: string (nullable = true)
| | | |-- 24x24: string (nullable = true)
| | | |-- 32x32: string (nullable = true)
| | | |-- 48x48: string (nullable = true)
| | |-- displayName: string (nullable = true)
| | |-- emailAddress: string (nullable = true)
| | |-- self: string (nullable = true)
| | |-- timeZone: string (nullable = true)
| |-- customfield_10001: string (nullable = true)
| |-- customfield_10002: string (nullable = true)
| |-- customfield_10003: string (nullable = true)
| |-- customfield_10004: string (nullable = true)
| |-- customfield_10005: string (nullable = true)
| |-- customfield_10006: string (nullable = true)
| |-- customfield_10007: string (nullable = true)
| |-- customfield_10008: string (nullable = true)
| |-- customfield_10009: string (nullable = true)
| |-- customfield_10010: string (nullable = true)
| |-- customfield_10014: string (nullable = true)
| |-- customfield_10015: string (nullable = true)
| |-- customfield_10016: string (nullable = true)
| |-- customfield_10017: string (nullable = true)
| |-- customfield_10018: struct (nullable = true)
| | |-- hasEpicLinkFieldDependency: boolean (nullable = true)
| | |-- nonEditableReason: struct (nullable = true)
| | | |-- message: string (nullable = true)
| | | |-- reason: string (nullable = true)
| | |-- showField: boolean (nullable = true)
| |-- customfield_10019: string (nullable = true)
| |-- customfield_10020: string (nullable = true)
| |-- customfield_10021: string (nullable = true)
| |-- customfield_10022: string (nullable = true)
| |-- customfield_10023: string (nullable = true)
| |-- customfield_10024: string (nullable = true)
| |-- customfield_10025: string (nullable = true)
| |-- customfield_10026: string (nullable = true)
| |-- customfield_10027: string (nullable = true)
| |-- customfield_10028: string (nullable = true)
| |-- customfield_10029: string (nullable = true)
| |-- customfield_10030: string (nullable = true)
| |-- description: string (nullable = true)
| |-- duedate: string (nullable = true)
| |-- environment: string (nullable = true)
| |-- fixVersions: array (nullable = true)
| | |-- element: string (containsNull = true)
| |-- issuelinks: array (nullable = true)
| | |-- element: string (containsNull = true)
| |-- issuerestriction: struct (nullable = true)
| | |-- shouldDisplay: boolean (nullable = true)
| |-- issuetype: struct (nullable = true)
| | |-- avatarId: long (nullable = true)
| | |-- description: string (nullable = true)
| | |-- entityId: string (nullable = true)
| | |-- hierarchyLevel: long (nullable = true)
| | |-- iconUrl: string (nullable = true)
| | |-- id: string (nullable = true)
| | |-- name: string (nullable = true)
| | |-- self: string (nullable = true)
| | |-- subtask: boolean (nullable = true)
| |-- labels: array (nullable = true)
| | |-- element: string (containsNull = true)
| |-- lastViewed: string (nullable = true)
| |-- priority: struct (nullable = true)
| | |-- iconUrl: string (nullable = true)
| | |-- id: string (nullable = true)
| | |-- name: string (nullable = true)
| | |-- self: string (nullable = true)
| |-- progress: struct (nullable = true)
| | |-- progress: long (nullable = true)
| | |-- total: long (nullable = true)
| |-- project: struct (nullable = true)
| | |-- avatarUrls: struct (nullable = true)
| | | |-- 16x16: string (nullable = true)
| | | |-- 24x24: string (nullable = true)
| | | |-- 32x32: string (nullable = true)
| | | |-- 48x48: string (nullable = true)
| | |-- id: string (nullable = true)
| | |-- key: string (nullable = true)
| | |-- name: string (nullable = true)
| | |-- projectTypeKey: string (nullable = true)
| | |-- self: string (nullable = true)
| | |-- simplified: boolean (nullable = true)
| |-- reporter: struct (nullable = true)
| | |-- accountId: string (nullable = true)
| | |-- accountType: string (nullable = true)
| | |-- active: boolean (nullable = true)
| | |-- avatarUrls: struct (nullable = true)
| | | |-- 16x16: string (nullable = true)
| | | |-- 24x24: string (nullable = true)
| | | |-- 32x32: string (nullable = true)
| | | |-- 48x48: string (nullable = true)
| | |-- displayName: string (nullable = true)
| | |-- emailAddress: string (nullable = true)
| | |-- self: string (nullable = true)
| | |-- timeZone: string (nullable = true)
| |-- resolution: string (nullable = true)
| |-- resolutiondate: string (nullable = true)
| |-- security: string (nullable = true)
| |-- status: struct (nullable = true)
| | |-- description: string (nullable = true)
| | |-- iconUrl: string (nullable = true)
| | |-- id: string (nullable = true)
|-- id: string (nullable = true)
|-- key: string (nullable = true)
|-- self: string (nullable = true)
10-11-2022 04:19 AM
Now it's working, when the message returned that it was not parallelized I searched and found the answer. When creating the Dataframe I changed it to:
@Werner Stinckens Thanks for the support.
df = spark.read.json(sc.parallelize([answer.text]))
10-08-2022 06:39 AM
@Werner Stinckens or @Hubert Dudek Could you help me?
I don't want all the information, just some. However, I can only do it in a static file.
10-10-2022 01:24 AM
you want help on how to define the schema?
10-10-2022 05:44 AM
Yes, it is returning null values as in the example I showed above.
10-10-2022 04:53 AM
@Werner Stinckens If you look at the Schema that was shown above, it has many levels and sub-levels, like: Struct, Array. In this Schema I created is returning only null values, I don't know where I'm going wrong.
schema = StructType([
StructField('fields', StructType([
StructField('comment', StructType([
StructField("comments",ArrayType( StructField('body', StringType())),True),
])),
])),
StructField('id', StringType()),
StructField('key', StringType()),
StructField('self', StringType())
])
df = spark.createDataFrame([response],schema)
df = df.withColumn("fields", explode((("fields"))))\
.withColumn("comment", explode((("fields.comment"))))\
.withColumn("comments", explode((("comment.comments"))))
10-10-2022 05:01 AM
Well, your schema seems ok, but I can't tell without the data itself.
Can you read the JSON files from JIRA with schema inference and then compare?
10-10-2022 05:08 AM
10-10-2022 05:15 AM
I mean a printschema or something of the json when you read it in a df.
Do you see all the data when you read the json with schema inference?
10-10-2022 05:18 AM
10-10-2022 05:21 AM
I'd land the json first (coming from the REST call), and then process it.
Do you now call the API using 'request' or something similar?
10-10-2022 05:24 AM
includes a picture in your post above, see please.
10-10-2022 05:25 AM
save the response as a json on your datalake, read it with spark and you have your schema.
10-10-2022 05:31 AM
I had already done that, but I would like to consume this data in the Dataframe do the transformations and then save the data in a DB.
10-10-2022 05:38 AM
yes, but hence my question: what happens if you read the json with schema inference?
Does that work?
If the JSON files can have a different schema, it is a good idea to use schema inference.
10-10-2022 05:47 AM
Got it, I'll do it this way. And I get back to you, thank you very much.
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.
If there isn’t a group near you, start one and help create a community that brings people together.
Request a New Group