cancel
Showing results for 
Search instead for 
Did you mean: 
Machine Learning
cancel
Showing results for 
Search instead for 
Did you mean: 

using openai Api in Databricks without iterating rows

Alessandro
New Contributor

 

Hi to everyone,

I have a delta table with a column 'comment' I would like to add a new column 'sentiment', and I would like to calculate it using openai API.

I already know how to create a databricks endpoint to an external model and how to use it (using for example langchain) on a single comment or iterating the comments in a table.

I would like to know if there is a way to do it more efficiently using an udf, or if it is possible to create a model in Databricks that calls the openai API with a fixed prompt, then create an endpoint to that model and call that in an udf.

Thank you in advance for the response, I hope my question is clear, if not let me know and I will try to explain it better

1 REPLY 1

Kaniz
Community Manager
Community Manager

Hi @AlessandroYour question is clear, and I appreciate your curiosity about optimizing the process.

Let’s explore a couple of approaches:

  1. UDF (User-Defined Function):

    • You can create a UDF in Databricks that invokes the OpenAI API for sentiment analysis on each comment in your delta table.
    • Here’s a high-level outline:
      1. Define a Python function that takes a comment as input and returns the sentiment score.
      2. Register this function as a UDF in Databricks.
      3. Apply the UDF to the ‘comment’ column to calculate sentiment scores and populate the new ‘sentiment’ column.
    • While this approach works, it involves iterating over each row, which might not be the most efficient.
  2. Model Endpoint:

    • You can create a model in Databricks that calls the OpenAI API with a fixed prompt (e.g., the comment text) and returns the sentiment score.
    • Steps:
      1. Train a model (e.g., using PyTorch or TensorFlow) that predicts sentiment based on input text.
      2. Deploy the model as an endpoint in Databricks.
      3. Create a UDF that calls this model endpoint for each comment, passing the comment text as input.
      4. Use the UDF to populate the ‘sentiment’ column.
    • This approach centralizes the API calls and may be more efficient than individual requests for each comment.

Remember to handle error cases (e.g., when the OpenAI API is unavailable) and consider performance implications (e.g., batch processing vs. real-time).

Feel free to ask for further details or clarification! 😊

 
Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.