spaCy, a natural language processing library

Explosion AI unveiled the launch of the new version of the free library «SpaCy»Which has an implementation of natural language processing algorithms (NLP). In practice, the project can be used to build autoresponders, bots, text classifiers, and various dialog systems that determine the meaning of phrases.

Library is designed to provide a persistent API It is not linked to the algorithms used and ready to use in real products. Library uses the latest advances in NLP and the most efficient algorithms available to process information.

If a more efficient algorithm appears, the library is passed to it, but this transition does not affect the API or applications.

A feature of spaCy it is also an architecture designed to process complete documents, without preprocessing in preprocessors that divide the document into phrases. Models are offered in two versions: for maximum productivity and maximum precision.

The main features of spaCy:

  • Support for around 60 languages.
  • Already trained models available for different languages ​​and applications.
  • Multitask learning using previously trained transformers like BERT (Bidirectional Encoder Renderings of Transformers).
  • Support for pre-trained vectors and word embeds.
  • High performance.
  • Ready-to-use on-the-job training system model.
  • Linguistically motivated tokenization.
  • Ready-to-use components are available for linking named entities, marking parts of speech, classifying text, analyzing tag-based dependencies, dividing sentences, marking parts of speech, morphological analysis, stemming, etc.
  • Support for extending functionality with custom components and attributes.
  • Support to create your own models based on PyTorch, TensorFlow and other frameworks.
  • Built-in tools for Named Entity Binding and Syntax Visualization (NER, Named Entity Recognition).
  • Simple process of packaging and deploying models and managing workflow.
  • High accuracy.

Library is written in Python with elements in Cython, a Python extension that allows direct function calling in the C language.

The project code is distributed under the MIT license. Language models are ready for 58 languages.

About the new version of spaCy 3.0

The spaCy 3.0 version stands out for the implementation of model families retrained for 18 languages ​​and 59 pipelines trained in total, including 5 new transformer-based pipelines

The model is offered in three versions (16 MB, 41 MB - 20 thousand vectors and 491 MB - 500 thousand vectors) and is optimized to work under CPU load and includes the tok2vec, morphologizer, parser, senter, ner, attribute_ruler, and lemmatizer components.

We have been working on spaCy v3.0 for over a year, and almost two years if you count all the work done on Thinc. Our main goal with the launch is to make it easier to carry your own models in SPACY, especially the state-of-the-art models like transformers. You can write models that feed the spaCy components into frameworks like PyTorch or TensorFlow, using our awesome new configuration system to describe all your settings. And since modern NLP workflows often consist of multiple steps, there is a new workflow system to help you keep your work organized.

Other important innovations that stand out from the new version:

  • New workflow for training models.
  • New configuration system.
  • Support for transformer-based pipeline models, suitable for multitasking learning.
  • The ability to connect your own models using various machine learning frameworks, such as PyTorch, TensorFlow, and MXNet.
  • Project support to manage all stages of workflows, from preprocessing to model implementation.
  • Support for integration with Data Version Control (DVC), Streamlit, Weights & Biases and Ray packages.
  • New built-in components: SentenceRecognizer, Morphologizer, Lemmatizer,
  • AttributeRuler and Transformer.
  • New API to create your own components.

Finally, if you are interested in knowing more about it of this new version or about spaCy, you can check the details In the following link.


Leave a Comment

Your email address will not be published. Required fields are marked with *

*

*

  1. Responsible for the data: Miguel Ángel Gatón
  2. Purpose of the data: Control SPAM, comment management.
  3. Legitimation: Your consent
  4. Communication of the data: The data will not be communicated to third parties except by legal obligation.
  5. Data storage: Database hosted by Occentus Networks (EU)
  6. Rights: At any time you can limit, recover and delete your information.