2 weeks ago
We need to import large amount of Jira data into Databricks, and should import only the delta changes. What's the best approach to do so? Using the Fivetran Jira connector or develop our own Python scripts/pipeline code? Thanks.
2 weeks ago
Hi @greengil,
Have you considered Lakeflow Connect? Databricks now has a native Jira connector in Lakeflow Connect that can achieve what you are looking for. It's in beta, but something you may want to consider.
It ingests Jira into Delta with incremental (delta) loads out of the box, supports SCD1/SCD2, handles deletes via audit logs, and runs fully managed on serverless with Unity Catalog governance. This is lower-effort and better integrated than both Fivetran and custom Python, and directly targets your large volume + only changes requirement.
If you canโt use the Databricks Jira connector, prefer Fivetran Jira --> Databricks over custom code for a managed, low-maintenance ELT path. Only build custom Python pipelines if you have very specific requirements that neither managed option can meet.
If this answer resolves your question, could you mark it as โAccept as Solutionโ? That helps other users quickly find the correct fix.
a week ago
Hi @Ashwin_DSA - Thank you for the information. Appreciate it. Regarding the built-in Lakeflow Connect, I see that it will inject all the Jira tables into Databricks. Is there a way to inject only a subset of data? For example, instead of all issues, I want only a subset. Thanks.
a week ago
Hi @greengil,
Yes, you can restrict what Lakeflow Connect for Jira ingests, both by table and by rows (partially).
In the UI, on the Source step, you can select only the tables you care about (for example, just issues, or issues + projects) instead of all source tables. In DABs/API, only list the tables you want under objects.
The Jira connector supports filtering by Jira project/space via jira_options.include_jira_spaces (list of project keys). In the UI, this is exposed as an option to filter the data by Jira spaces or projects (you enter project keys, not names or IDs).
If you are looking for anything more granular than project/space (e.g. specific issue types, statuses, labels), then that's ot supported as of now. The connector ingests all matching issues for those projects/spaces, and you then filter downstream in silver/gold tables. More general row-level filtering for Jira is on the backlog but not yet available.
Refer to these pages jira pipeline and limitation.
If this answer resolves your question, could you mark it as โAccept as Solutionโ? That helps other users quickly find the correct fix.
Wednesday
Hi @Ashwin_DSA - I tried out Lakeflow Connect by following the instructions here: https://docs.databricks.com/aws/en/ingestion/lakeflow-connect/jira
When running the pipeline, I got the below error on each table data injection. I did make sure the Oauth and connector were set up correctly. Do you know why? Thanks in advance!
com.databricks.pipelines.execution.conduit.common.DataConnectorException: [SAAS_CONNECTOR_SOURCE_API_ERROR] An error occurred in the JIRA API call. Source API type: sourceApi.jira.listBoards. Error code: API_ERROR.
Try refreshing the destination table. If the issue persists, please file a ticket.
Thursday
Hi @greengil,
The error messages you posted are two separate issues, but they probably appear at the same time.
The SAAS_CONNECTOR_SOURCE_API_ERROR on sourceApi.jira.listBoards usually means Jira itself is rejecting the listBoards call, most often because the OAuth app or connection user doesnโt have enough board/project permissions. For the Lakeflow Jira connector, you need not just the base scopes like read:jira-work and read:jira-user, but also the board-related scopes such as read:board-scope.admin:jira-software and read:project:jira, and the connection user should have at least Administer Projects / ADMINISTER GLOBAL privileges in Jira for the tables youโre ingesting.
If youโre filtering by project, double-check that include_jira_spaces contains exact project keys (for example ENG, PROD), not project names or numeric IDs, and that the projects are active and accessible to that user.
The other error.... UnknownHostException: api.atlassian.com on issues_without_deletes is a lower-level networking problem. The serverless cluster canโt resolve or reach api.atlassian.com. That typically happens when serverless egress control or another outbound network policy is enabled but the Atlassian domains arenโt allowlisted, similar to how other SaaS connectors fail with UnknownHostException when their API hosts are blocked.
If your workspace uses serverless egress/network policies, youโll need to allow outbound HTTPS to api.atlassian.com (and your <tenant>.atlassian.net Jira host), or relax the policy so those domains are reachable, then rerun the pipeline. Once DNS/network to api.atlassian.com is fixed and the OAuth scopes and Jira permissions are aligned with the Jira connector reference, rerun the pipeline...
If the same DataConnectorException persists, itโs worth opening a Databricks support ticket with the pipeline ID and failing run ID so the team can look at backend logs, especially since the Jira connector is still in Beta.
If this answer resolves your question, could you mark it as โAccept as Solutionโ? That helps other users quickly find the correct fix.
Thursday
Hi @Ashwin_DSA -
I am trying this on the Databricks free edition. How do I set up outbound allowlist? Once everything working, I will do this setup in our paid version of Databricks. Or perhaps this free edition won't allow any public access?
Regarding the permissions, I have granted all the permissions as documented in the articles. I also use Oauth to connect and I logged in with my account which has global admin privileges, which should have all the needed permissions. Basically, the ingestion fails on every table. Maybe it's due to that api.atlassian.com issue discussed above?
By the way, where do I specify to inject only the specific Jira projects? During the configuration, I can select which table to inject, such as issues, issuetype, etc, but I don't see the option to inject only certain projects.
Thanks!
Thursday
Hi @greengil,
The UnknownHostException for api.atlassian.com is a pure networking/DNS problem, not a permissions issue. The Jira connector talks to both your <tenant>.atlassian.net and api.atlassian.com (for things like audit logs/deletes etc). If the serverless cluster canโt resolve or reach api.atlassian.com, all the tables that rely on that path will fail, regardless of how perfect your OAuth scopes and Jira permissions are.
On Databricks, that kind of UnknownHostException usually means serverless egress control or some other network policy is in place, and the SaaS hostname isnโt on the allowlist. The outbound allowlist is configured via serverless network policies at the account level, and thatโs not something you can tweak from a free/Community-style workspace UI. In practice, that means in the free edition youโre using now, you probably canโt change the outbound allowlist yourself. In a proper paid workspace, an account admin can define a serverless network policy that allowlists api.atlassian.com and your <tenant>.atlassian.net host, which should clear the UnknownHostException for the Jira connector.
Given youโve already granted the documented scopes and authenticated with a global admin Jira account, your permissions setup is likely fine. The fact that ingestion fails for every table lines up with the api.atlassian.com DNS issue being the real blocker.
On your last question about only specific Jira projects, the current UI wizard only lets you pick which tables to ingest (issues, issue_types, etc.). Project-level filtering is exposed through the pipeline definition rather than the click-through UI. In the YAML / notebook examples, you can add:
connector_options:
jira_options:
include_jira_spaces:
- KEY1
- KEY2
If this answer resolves your question, could you mark it as โAccept as Solutionโ? That helps other users quickly find the correct fix.
Friday
@Ashwin_DSA Thank you! Will try that on the paid version. Another question, I assume this connector will take care of deleted items, such as Jira issues, Components, custom field values? Same for custom field value name change? Thanks.
Saturday
Hi @greengil,
Mostly yes... but with some caveats.
For Jira issues, the connector does support deletes. It relies on the Jira audit logs API, and when thatโs enabled, and the connection user has global admin, deleted issues are tracked. With SCD2 enabled, they receive a delete timestamp. With SCD1, theyโre removed from the destination table on the next run.
For comments and worklogs, deletes are not handled incrementally. You only pick those up via a full refresh of the corresponding tables.
For dimension style entities like components, projects, users, custom field definitions, etc., those tables are modelled as full refresh on each run, so if a component or custom field is deleted, or you rename it, the next pipeline run will just reflect the current state from Jira (old entries disappear / names change to the new ones). Check this page.
For custom field values per issue, the issue_field_values table is incremental (SCD1/SCD2), so changes to the value are picked up on update. With SCD2 you can also see the history of those value changes over time. Check this page.
If this answer resolves your question, could you mark it as โAccept as Solutionโ? That helps other users quickly find the correct fix.
Wednesday
There are other errors, like this one:
a week ago
Hi @greengil good question, I went through this something similar recently, so sharing what I found.
My instinct was also to build it in Python, but once I dug in, the "just write a script" path hides a lot of pain:
Fivetran's Jira connector handles all of this natively, JQL-based incremental sync, webhook-based deletion capture, auto-populated ISSUE_FIELD_HISTORY tables, schema drift detection, MERGE into Delta, and it's available through Databricks Partner Connect for quick setup. There's also a free dbt package (fivetran/dbt_jira) with pre-built analytics models.
My take: I would suggest go with Fivetran unless you have a specific reason not to - high volume cost concerns, need for archived issues, or data residency restrictions. Custom Python makes sense for narrow use cases, but it's weeks of build plus ongoing maintenance.
References I did research and came up with solution, please take a look, I think you will find it really helpful:
Happy to dig in further if you're leaning one way.
Wednesday
I'll keep this in mind. Thanks!