01-13-2023 07:13 AM
I am actually trying to extract the adjective and noun phrases from the text column in spark data frame for which I've written the udf and applying on cleaned text column. However, I am getting this error.
from pyspark.sql.functions import udf
from pyspark.sql.types import ArrayType, StringType
import spacy
# Load spacy model
nlp = spacy.load("en_core_web_sm")
# Define UDF to extract key phrases
def extract_adjective_noun_key_phrases(text):
doc = nlp(text)
key_phrases = []
for token in doc:
if (token.pos_ == "ADJ" and token.nbor().pos_ == "NOUN") or (token.pos_ == "NOUN" and token.nbor().pos_ == "ADJ"):
key_phrases.append(token.text + " " + token.nbor().text)
return key_phrases
extract_adjective_noun_key_phrases_udf = udf(extract_adjective_noun_key_phrases, ArrayType(StringType()))
# Apply UDF to text column in DataFrame
pqms = pqms.withColumn("adjective_noun_key_phrases", extract_adjective_noun_key_phrases_udf(col("cleaned_text")))
# Print resulting DataFrame
display(pqms)
The expected output here to extract the phrases and create a new column for the same in spark data frame. Any help or suggestion on this will be a great help.
Thanks,
01-13-2023 08:59 AM
Hi @Aditya Singh ,
What cluster node types and DBR version are you using? Also are you installing spacy manually? Usually, the ModuleNotFoundError indicates that the library you are importing has not been installed or installed correctly. You could try on DBR 11.3 LTS ML that comes pre-installed with spacy
01-13-2023 09:22 AM
Hi LandanG, Thanks for your quick response. I am using DBR 9.1 LTS (includes Apache Spark 3.1.2, Scala 2.12), not sure what cluster node types means and I am trying to install spacy manually using- import sys
!{sys.executable} -m pip install spacy
Is there any other way we can install spacy as I don't have access to install libraries directly to clusters from pypi or maven repository?
Thanks,
01-13-2023 09:34 AM
@Aditya Singh
Could you try installing it like
%pip install spacy
instead? This will be a notebook-scoped library and you can run it in a notebook cell. Hopefully this works.
Thanks,
01-13-2023 10:28 AM
Thanks for your suggestion LandanG. Now, I am able to install notebook-scoped spacy library and could see when i run %pip freeze. However, when I am importing it - import spacy
Its throwing new error now- ModuleNotFoundError: No module named 'spacy'.
01-13-2023 08:50 PM
@Aditya Singh
goto compute click the cluster you needed click the Libraries tab and select PyPI.
Enter a PyPI package name. To install a specific version of a library use this format for the library:
<library>==<version> For example, spacy==3.4.4.
01-13-2023 11:44 PM
only init script will work here
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.
If there isn’t a group near you, start one and help create a community that brings people together.
Request a New Group