Databricks Community

greengil · 2 weeks ago

We need to import large amount of Jira data into Databricks, and should import only the delta changes. What's the best approach to do so? Using the Fivetran Jira connector or develop our own Python scripts/pipeline code? Thanks.

Ashwin_DSA · 2 weeks ago

Hi @greengil,

Have you considered Lakeflow Connect? Databricks now has a native Jira connector in Lakeflow Connect that can achieve what you are looking for. It's in beta, but something you may want to consider.

It ingests Jira into Delta with incremental (delta) loads out of the box, supports SCD1/SCD2, handles deletes via audit logs, and runs fully managed on serverless with Unity Catalog governance. This is lower-effort and better integrated than both Fivetran and custom Python, and directly targets your large volume + only changes requirement.

If you can’t use the Databricks Jira connector, prefer Fivetran Jira --> Databricks over custom code for a managed, low-maintenance ELT path. Only build custom Python pipelines if you have very specific requirements that neither managed option can meet.

If this answer resolves your question, could you mark it as “Accept as Solution”? That helps other users quickly find the correct fix.

Regards,
Ashwin | Delivery Solution Architect @ Databricks
Helping you build and scale the Data Intelligence Platform.
***Opinions are my own***

greengil · a week ago

Hi @Ashwin_DSA - Thank you for the information. Appreciate it. Regarding the built-in Lakeflow Connect, I see that it will inject all the Jira tables into Databricks. Is there a way to inject only a subset of data? For example, instead of all issues, I want only a subset. Thanks.

Ashwin_DSA · a week ago

Hi @greengil,

Yes, you can restrict what Lakeflow Connect for Jira ingests, both by table and by rows (partially).

In the UI, on the Source step, you can select only the tables you care about (for example, just issues, or issues + projects) instead of all source tables. In DABs/API, only list the tables you want under objects.

The Jira connector supports filtering by Jira project/space via jira_options.include_jira_spaces (list of project keys). In the UI, this is exposed as an option to filter the data by Jira spaces or projects (you enter project keys, not names or IDs).

If you are looking for anything more granular than project/space (e.g. specific issue types, statuses, labels), then that's ot supported as of now. The connector ingests all matching issues for those projects/spaces, and you then filter downstream in silver/gold tables. More general row-level filtering for Jira is on the backlog but not yet available.

Refer to these pages jira pipeline and limitation.

If this answer resolves your question, could you mark it as “Accept as Solution”? That helps other users quickly find the correct fix.

Regards,
Ashwin | Delivery Solution Architect @ Databricks
Helping you build and scale the Data Intelligence Platform.
***Opinions are my own***

greengil · Wednesday

Hi @Ashwin_DSA - I tried out Lakeflow Connect by following the instructions here: https://docs.databricks.com/aws/en/ingestion/lakeflow-connect/jira

When running the pipeline, I got the below error on each table data injection. I did make sure the Oauth and connector were set up correctly. Do you know why? Thanks in advance!

com.databricks.pipelines.execution.conduit.common.DataConnectorException: [SAAS_CONNECTOR_SOURCE_API_ERROR] An error occurred in the JIRA API call. Source API type: sourceApi.jira.listBoards. Error code: API_ERROR.
Try refreshing the destination table. If the issue persists, please file a ticket.

Ashwin_DSA · Thursday

Hi @greengil,

The error messages you posted are two separate issues, but they probably appear at the same time.

The SAAS_CONNECTOR_SOURCE_API_ERROR on sourceApi.jira.listBoards usually means Jira itself is rejecting the listBoards call, most often because the OAuth app or connection user doesn’t have enough board/project permissions. For the Lakeflow Jira connector, you need not just the base scopes like read:jira-work and read:jira-user, but also the board-related scopes such as read:board-scope.admin:jira-software and read:project:jira, and the connection user should have at least Administer Projects / ADMINISTER GLOBAL privileges in Jira for the tables you’re ingesting.

If you’re filtering by project, double-check that include_jira_spaces contains exact project keys (for example ENG, PROD), not project names or numeric IDs, and that the projects are active and accessible to that user.

The other error.... UnknownHostException: api.atlassian.com on issues_without_deletes is a lower-level networking problem. The serverless cluster can’t resolve or reach api.atlassian.com. That typically happens when serverless egress control or another outbound network policy is enabled but the Atlassian domains aren’t allowlisted, similar to how other SaaS connectors fail with UnknownHostException when their API hosts are blocked.

If your workspace uses serverless egress/network policies, you’ll need to allow outbound HTTPS to api.atlassian.com (and your <tenant>.atlassian.net Jira host), or relax the policy so those domains are reachable, then rerun the pipeline. Once DNS/network to api.atlassian.com is fixed and the OAuth scopes and Jira permissions are aligned with the Jira connector reference, rerun the pipeline...

If the same DataConnectorException persists, it’s worth opening a Databricks support ticket with the pipeline ID and failing run ID so the team can look at backend logs, especially since the Jira connector is still in Beta.

If this answer resolves your question, could you mark it as “Accept as Solution”? That helps other users quickly find the correct fix.

Regards,
Ashwin | Delivery Solution Architect @ Databricks
Helping you build and scale the Data Intelligence Platform.
***Opinions are my own***

greengil · Thursday

Hi @Ashwin_DSA -

I am trying this on the Databricks free edition. How do I set up outbound allowlist? Once everything working, I will do this setup in our paid version of Databricks. Or perhaps this free edition won't allow any public access?

Regarding the permissions, I have granted all the permissions as documented in the articles. I also use Oauth to connect and I logged in with my account which has global admin privileges, which should have all the needed permissions. Basically, the ingestion fails on every table. Maybe it's due to that api.atlassian.com issue discussed above?

By the way, where do I specify to inject only the specific Jira projects? During the configuration, I can select which table to inject, such as issues, issuetype, etc, but I don't see the option to inject only certain projects.

Thanks!

Ashwin_DSA · Thursday

Hi @greengil,

The UnknownHostException for api.atlassian.com is a pure networking/DNS problem, not a permissions issue. The Jira connector talks to both your <tenant>.atlassian.net and api.atlassian.com (for things like audit logs/deletes etc). If the serverless cluster can’t resolve or reach api.atlassian.com, all the tables that rely on that path will fail, regardless of how perfect your OAuth scopes and Jira permissions are.

On Databricks, that kind of UnknownHostException usually means serverless egress control or some other network policy is in place, and the SaaS hostname isn’t on the allowlist. The outbound allowlist is configured via serverless network policies at the account level, and that’s not something you can tweak from a free/Community-style workspace UI. In practice, that means in the free edition you’re using now, you probably can’t change the outbound allowlist yourself. In a proper paid workspace, an account admin can define a serverless network policy that allowlists api.atlassian.com and your <tenant>.atlassian.net host, which should clear the UnknownHostException for the Jira connector.

Given you’ve already granted the documented scopes and authenticated with a global admin Jira account, your permissions setup is likely fine. The fact that ingestion fails for every table lines up with the api.atlassian.com DNS issue being the real blocker.

On your last question about only specific Jira projects, the current UI wizard only lets you pick which tables to ingest (issues, issue_types, etc.). Project-level filtering is exposed through the pipeline definition rather than the click-through UI. In the YAML / notebook examples, you can add:

connector_options:
  jira_options:
    include_jira_spaces:
      - KEY1
      - KEY2

to each table object, where KEY1, KEY2 are the project keys, not names or IDs (for example ENG, PROD). That’s how you tell the connector to ingest only specific projects.

So in your shoes I’d treat the free workspace as a place to learn the flow, but expect the real end-to-end test (with api.atlassian.com reachable and an outbound allowlist) to happen in your paid workspace, with an admin setting up the appropriate serverless network policy.

Hope this clarifies.

If this answer resolves your question, could you mark it as “Accept as Solution”? That helps other users quickly find the correct fix.

Regards,
Ashwin | Delivery Solution Architect @ Databricks
Helping you build and scale the Data Intelligence Platform.
***Opinions are my own***

greengil · Friday

@Ashwin_DSA Thank you! Will try that on the paid version. Another question, I assume this connector will take care of deleted items, such as Jira issues, Components, custom field values? Same for custom field value name change? Thanks.

Ashwin_DSA · Saturday

Hi @greengil,

Mostly yes... but with some caveats.

For Jira issues, the connector does support deletes. It relies on the Jira audit logs API, and when that’s enabled, and the connection user has global admin, deleted issues are tracked. With SCD2 enabled, they receive a delete timestamp. With SCD1, they’re removed from the destination table on the next run.

For comments and worklogs, deletes are not handled incrementally. You only pick those up via a full refresh of the corresponding tables.

For dimension style entities like components, projects, users, custom field definitions, etc., those tables are modelled as full refresh on each run, so if a component or custom field is deleted, or you rename it, the next pipeline run will just reflect the current state from Jira (old entries disappear / names change to the new ones). Check this page.

For custom field values per issue, the issue_field_values table is incremental (SCD1/SCD2), so changes to the value are picked up on update. With SCD2 you can also see the history of those value changes over time. Check this page.

Hope this helps.

If this answer resolves your question, could you mark it as “Accept as Solution”? That helps other users quickly find the correct fix.

Regards,
Ashwin | Delivery Solution Architect @ Databricks
Helping you build and scale the Data Intelligence Platform.
***Opinions are my own***

greengil · Wednesday

There are other errors, like this one:

api.atlassian.com

Error class: UnknownHostException

Table: issues_without_deletes

Message: api.atlassian.com

abhi_dabhi · a week ago

Hi @greengil good question, I went through this something similar recently, so sharing what I found.

My instinct was also to build it in Python, but once I dug in, the "just write a script" path hides a lot of pain:

Deletions are invisible. Jira's REST API doesn't return deleted issues. Without webhooks, you'll have ghost records in Delta forever.
Field history isn't free. The API gives you current state, not change history. Reporting usually needs history, which means building and maintaining it yourself.
Archived issues aren't returned in JQL queries, only by ID.
Rate limits, pagination, schema drift for custom fields, all real work.

Fivetran's Jira connector handles all of this natively, JQL-based incremental sync, webhook-based deletion capture, auto-populated ISSUE_FIELD_HISTORY tables, schema drift detection, MERGE into Delta, and it's available through Databricks Partner Connect for quick setup. There's also a free dbt package (fivetran/dbt_jira) with pre-built analytics models.

My take: I would suggest go with Fivetran unless you have a specific reason not to - high volume cost concerns, need for archived issues, or data residency restrictions. Custom Python makes sense for narrow use cases, but it's weeks of build plus ongoing maintenance.

References I did research and came up with solution, please take a look, I think you will find it really helpful: