Hi @Samy Syed,,
Spark NLP library and all the pretrained models/pipelines can be used entirely offline without Internet access. Suppose you are behind a proxy or a firewall with no access to the Maven repository (to download packages) or/and no access to S3 (to download models and pipelines automatically). In that case, you can simply follow the instructions to have Spark NLP without any limitations offline:
- Instead of using the Maven package, you need to load the F/at JAR
- Instead of using PretrainedPipeline for pretrained pipelines or the .pretrained() function to download pretrained models, you must manually download your pipeline/model from Models Hub, extract it, and load it.
Example of SparkSession with F/at JAR to have Spark NLP offline:
spark = SparkSession.builder \
.appName("Spark NLP")\
.master("local[*]")\
.config("spark.driver.memory","16G")\
.config("spark.driver.maxResultSize", "0") \
.config("spark.kryoserializer.buffer.max", "2000M")\
.config("spark.jars", "/tmp/spark-nlp-assembly-4.3.0.jar")\
.getOrCreate()
- You can download provided F/at JARs from each release note; please pay attention to pick the one that suits your environment depending on the device (CPU/GPU) and Apache Spark™ version (3. x)
- If you are local, you can load the F/at JAR from your local FileSystem; however, if you are in a cluster setup, you need to put the F/at JAR on a distributed FileSystem such as HDFS, DBFS, S3, etc. (i.e., hdfs:///tmp/spark-nlp-assembly-4.3.0.jar)
Example of using pretrained Models and Pipelines offline:
# instead of using pretrained() for online:
# french_pos = PerceptronModel.pretrained("pos_ud_gsd", lang="fr")
# you download this model, extract it, and use .load
french_pos = PerceptronModel.load("/tmp/pos_ud_gsd_fr_2.0.2_2.4_1556531457346/")\
.setInputCols("document", "token")\
.setOutputCol("pos")
# example for pipelines
# instead of using PretrainedPipeline
# pipeline = PretrainedPipeline('explain_document_dl', lang='en')
# you download this pipeline, extract it, and use PipelineModel
PipelineModel.load("/tmp/explain_document_dl_en_2.0.2_2.4_1556530585689/")
- Since you are manually downloading and loading models/pipelines, Spark NLP is not downloading the most recent and compatible models/pipelines for you. Choosing the proper model/pipeline is on you.
- If you are local, you can load the model/pipeline from your local FileSystem; however, if you are in a cluster setup, you need to put the model/pipeline on a distributed FileSystem such as HDFS, DBFS, S3, etc. (i.e., hdfs:///tmp/explain_document_dl_en_2.0.2_2.4_1556530585689/)
SOURCE