Artificial Intelligence: Language
Natural Language Processing spans all tasks where the AI gets human language as input. The following are a few examples of such tasks:
- automatic summarization, where the AI is given text as input and it produces a summary of the text as output.
- information extraction, where the AI is given a corpus of text and the AI extracts data as output.
- language identification, where the AI is given text and returns the language of the text as output.
- machine translation, where the AI is given a text in the origin language and it outputs the translation in the target language.
- named entity recognition, where the AI is given text and it extracts the names of the entities in the text (for example, names of companies).
- speech recognition, where the AI is given speech and it produces the same words in text.
- text classification, where the AI is given text and it needs to classify it as some type of text.
- word sense disambiguation, where the AI needs to choose the right meaning of a word that has multiple meanings (e.g. bank means both a financial institution and the ground on the sides of a river).
Syntax vs. Semantics
In the realm of machine learning and artificial intelligence, particularly when dealing with language, syntax refers to the rules governing the structure and order of words in a sentence. It's the grammar that underpins human language, and AI systems strive to understand and replicate it to process and generate human-like text.
While syntax focuses on the structure of language, semantics deals with the meaning and interpretation of language. It's the bridge between the words on a page and the concepts they represent.
Despite these challenges, advancements in AI and ML, particularly in deep learning, have significantly improved our ability to process and understand language semantically. By combining syntactic and semantic analysis, AI systems can become more intelligent and capable of interacting with humans in a more natural way.
Strategies
What is n-gram?
An n-gram is a sequence of n items from a sample of text. In a character n-gram, the items are characters, and in a word n-gram the items are words. A unigram, bigram, and trigram are sequences of one, two, and three items. In the following sentence, the first three n-grams are “It does not,” “does not do,” and “not do well”.
"It does not do well to dwell on dreams and forget to live." ― Albus Dumbledore
n-grams are useful for text processing. While the AI hasn’t necessarily seen the whole sentence before, it sure has seen parts of it, like “forget to live.” Since some words occur together more often than others, it is possible to also predict the next word with some probability.
What is Tokenization?
At its core, tokenization is the process of breaking down a piece of text into smaller, meaningful units called tokens. These tokens can be words, sentences, or even subwords (like prefixes or suffixes) depending on the specific task and the language being processed.
Word Tokenization
- Goal: To break down text into individual words.
- Process:
- Identifying Word Boundaries: This is typically done by looking for spaces, punctuation marks, or other delimiters that separate words.
- Creating Tokens: Each word is treated as a separate token.
Example:
- Text: "The quick brown fox jumps over the lazy dog."
- Word Tokens: ["The", "quick", "brown", "fox", "jumps", "over", "the", "lazy", "dog"]
Sentence Tokenization
- Goal: To divide text into individual sentences.
- Process:
- Identifying Sentence Boundaries: This is typically done by looking for punctuation marks like periods (.), question marks (?), exclamation points (!), and sometimes semicolons (;).
- Creating Tokens: Each sentence is treated as a separate token.
Example:
- Text: "The quick brown fox jumps over the lazy dog. Did you see that?"
- Sentence Tokens: ["The quick brown fox jumps over the lazy dog.", "Did you see that?"]
Why Tokenization Matters in Machine Learning
Tokenization is a crucial step in many NLP tasks because it allows machines to process and understand human language. Here are some reasons why:
- Feature Extraction: Tokenization helps create features (like word frequencies, word embeddings) that can be used to train machine learning models.
- Text Analysis: Tokenization enables tasks like sentiment analysis, text summarization, and machine translation by breaking down text into manageable units.
- Language Modeling: Tokenization helps language models understand the structure and grammar of a language by analyzing the sequence of words and sentences.
[1]: Brian Yu, David J. Malan; Harvard University CS50's Introduction to Artificial Intelligence with Python