What is NLP and how we use it to detect malicious emails
Updated at: Jul 13, 2020
Natural language processing (NLP) is a field of artificial intelligence and machine learning that deals with the ability of a computer or machine to understand and analyze human language. Depending on the application, the machine can even generate or create human language.
To put it another way, NLP is a field of study that involves different algorithms and concepts that allow a machine to make decisions based on information and interactions from human language.
When we say human language, we’re talking about speech, writing and signs.
To illustrate, natural language processing is behind several activities in our daily lives. For example, it's through NLP that you can communicate with a virtual assistant, such as Siri (from Apple), Alexa (from Amazon), and Google Assistant.
Other NLP applications involve automatic text correction, content translation, use of chatbots on websites, conversion of sign language to text, speech recognition, and even identification of malicious emails, like spam, and phishing.
This last point, in fact, says a lot about our work here at Gatefy. One of the main NLP algorithms used by us to detect malicious emails is BERT. We'll talk more about it later in this article.
Components of natural language
As you can already see, NLP is applied in different areas and technologies. The main point is that each area makes different use of NLP, taking into account 7 components that form the basis of a natural language.
In short, phonetics and phonology are about sound and its acoustic properties. Morphology concerns the structure of words. Lexicon and syntax are related to the use and structure of words and phrases.
Finally, semantics and pragmatics analyze the meaning and context of sentences, paragraphs, and texts.
Most used algorithms in NLP
There are different techniques and algorithms used in NLP. We will briefly explain some of them. Then we’ll focus on BERT and the role it plays in detecting malicious emails.
1. Bag of Words
Bag of Words is an algorithm used to vectorize information from a text. That is, it’s a way to check the occurrence of words, or count words.
TFIDF is an algorithm that takes into account the occurrence and also the frequency in which words appear in texts. Some terms can have positive weight while others, negative.
Stemming is a more rustic model used for text normalization or classification. It focuses on the root of words, removing affixes (prefixes, infixes, and suffixes).
Lemmatization is a technique used to convert words into their basic form (lema) and to group different forms of the same term. It’s also used for text normalization.
BERT is a high-performance algorithm used to understand and analyze a text based on the context of the words.
Using NLP and BERT to detect malicious emails, such as spam and phishing
As we mentioned, BERT (Bidirectional Encoder Representations from Transformers) is an algorithm in the NLP field that has the ability to analyze and learn relationships between words based on a context. In NLP, this mechanism is called attention.
Another great advantage of BERT in relation to other language models is that it was designed to analyze texts in both directions. That is, from right to left, and from left to right. This mechanism is called bidirectionality.
The combination of attention and bidirectionality mechanisms allows some systems based on BERT to be extremely efficient in identifying and classifying texts. And this is where Gatefy's evolution in detecting malicious emails comes in.
We've adopted BERT model as one of the main mechanisms of our artificial intelligence system.
This way, our system is able to analyze and understand the message's context and then define with precision whether it is a legitimate or malicious one, such as a spam or phishing campaign.
The result of BERT plus other types of algorithms is a more efficient and faster artificial intelligence system.
In other words, we’re talking about email security and better message management. Gatefy's email protection system allows you to have visibility and control over emails, minimizing the risk of data breaches and infections.
Your team will not waste time handling unwanted messages, nor risk being more exposed to threats that could compromise the entire company.
To sum it up, it's important to be clear that Gatefy's artificial intelligence system is always learning. Over time, the solution becomes smarter and more accurate, improving its own performance and results.
It's also important to keep in mind that human language is a very complicated and complex area, and that is why we use different techniques to handle different challenges and types of cyber threats.