cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Email Extraction

Sangeetha112
New Contributor

Hi , Hope you are doing well. I was trying to extract a specific email attachment from the outlook, and inject into the dbfs loaction, but something went wrong. Could you please help. I am hereby giving the code whcih I used. 

 

import imaplib
import email
import os
from email.header import decode_header
from email.utils import parseaddr
import base64

IMAP_SERVER = "outlook.office365.com"
IMAP_PORT = 993
EMAIL_ACCOUNT = "------------"
PASSWORD =

try:
    mail = imaplib.IMAP4_SSL(IMAP_SERVER, IMAP_PORT)
    mail.login(EMAIL_ACCOUNT, PASSWORD)
    mail.select("inbox")

    status, messages = mail.search(None, '(SUBJECT "API_Files")')
    email_ids = messages[0].split()

    for email_id in email_ids:
        status, msg_data = mail.fetch(email_id, "(RFC822)")
        for response_part in msg_data:
            if isinstance(response_part, tuple😞
                msg = email.message_from_bytes(response_part[1])
               
                subject, encoding = decode_header(msg["Subject"])[0]
                if isinstance(subject, bytes😞
                    subject = subject.decode(encoding if encoding else "utf-8")
                   
                from_ = msg.get("From")
                from_email = parseaddr(from_)[1]
               
                if msg.is_multipart():
                    for part in msg.walk():
                        content_type = part.get_content_type()
                        content_disposition = str(part.get("Content-Disposition"))
                       
                        if "attachment" in content_disposition:
                            filename = part.get_filename()
                            if filename:
                                filepath = f"dbfs:/tmp{filename}"
                                with open(filepath, "wb") as f:
                                    f.write(part.get_payload(decode=True))
                                print(f"Attachment saved to {filepath}")
                else:
                    pass
    mail.logout()
except imaplib.IMAP4.error as e:
    print(f"IMAP error: {e}")
1 REPLY 1

Stefan-Koch
Contributor III

If you face issues with IMAP, consider using Microsoft Graph API for email access. It provides robust support for Outlook without handling IMAP details and enhances security with OAuth2 tokens.

Followed is a sample script, but I didn't tested it:

 

pip install msal

import os
import requests
from msal import ConfidentialClientApplication

# Azure AD App Credentials
CLIENT_ID = os.getenv("CLIENT_ID")  # Client ID from Azure App Registration
CLIENT_SECRET = os.getenv("CLIENT_SECRET")  # Client Secret from Azure App
TENANT_ID = os.getenv("TENANT_ID")  # Tenant ID
EMAIL_ADDRESS = "your-email@company.com"

# Microsoft Graph API URL
GRAPH_API_ENDPOINT = "https://graph.microsoft.com/v1.0"

# Authentication
def get_access_token():
    app = ConfidentialClientApplication(
        CLIENT_ID,
        authority=f"https://login.microsoftonline.com/{TENANT_ID}",
        client_credential=CLIENT_SECRET,
    )

    token_response = app.acquire_token_for_client(scopes=["https://graph.microsoft.com/.default"])
    if "access_token" in token_response:
        return token_response["access_token"]
    else:
        raise Exception(f"Failed to get access token: {token_response}")

# Get emails with attachments
def get_emails_with_attachments():
    access_token = get_access_token()
    headers = {"Authorization": f"Bearer {access_token}"}
    
    # Fetch the first 10 emails
    response = requests.get(f"{GRAPH_API_ENDPOINT}/users/{EMAIL_ADDRESS}/messages?$filter=hasAttachments eq true", headers=headers)
    response.raise_for_status()
    emails = response.json()["value"]

    for email in emails:
        print(f"Email Subject: {email['subject']}")
        email_id = email["id"]
        download_attachments(email_id, headers)

# Download attachments
def download_attachments(email_id, headers):
    response = requests.get(f"{GRAPH_API_ENDPOINT}/me/messages/{email_id}/attachments", headers=headers)
    response.raise_for_status()
    attachments = response.json()["value"]

    for attachment in attachments:
        if "contentBytes" in attachment:
            filename = attachment["name"]
            file_data = attachment["contentBytes"]
            file_path = f"/tmp/{filename}"

            # Save locally first
            with open(file_path, "wb") as f:
                f.write(bytes.fromhex(file_data.encode("utf-8").hex()))
            print(f"Saved attachment: {filename}")

            # Upload to DBFS
            dbfs_path = f"/dbfs/tmp/{filename}"
            dbutils.fs.cp(f"file:{file_path}", dbfs_path)
            print(f"Uploaded to DBFS: {dbfs_path}")

 

 

Another Approach could be, to use Logic Apps, if you are in the Azure Cloud. Have a look here: https://bakshiharsh55.medium.com/save-e-mail-attachment-to-blob-storage-utilizing-azure-logic-app-9d...

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group