MidSem - 1

  1. Describe in brief the need to study NLP. Also explain two applications of NLP in brief.

    Natural Language Processing (NLP) is an interdisciplinary field that focuses on making computers understand, interpret, and generate natural language text or speech. The need to study NLP arises from the fact that humans primarily communicate through language, and language processing is crucial for several applications such as information retrieval, text classification, machine translation, sentiment analysis, and speech recognition. Here are two examples of NLP applications:

    1. Chatbots: Chatbots are computer programs that interact with users through natural language text or speech. They use NLP techniques to analyze user input, understand their intent, and provide appropriate responses. Chatbots are widely used in customer service, e-commerce, and healthcare, among other industries. They can help improve customer experience, automate tasks, and reduce costs.
    2. Sentiment analysis: Sentiment analysis is the process of determining the emotional tone of a piece of text. NLP techniques can be used to analyze social media posts, customer reviews, and other forms of text data to determine the sentiment (positive, negative, or neutral) expressed. Sentiment analysis has several applications, including brand monitoring, market research, and customer feedback analysis.

    In summary, the study of NLP is essential for building intelligent systems that can understand, interpret, and generate natural language text or speech. NLP has several applications, including chatbots, sentiment analysis, machine translation, and speech recognition, among others.

  2. Explain five phases of NLP in brief.

    The five phases of Natural Language Processing (NLP) are as follows:

    1. Lexical Analysis: This is the first phase of NLP and involves breaking down the input text into words or tokens, also known as tokenization. It involves identifying the base form of each word (lemmatization) and determining the part of speech (POS) of each word (part-of-speech tagging). The output of this phase is a list of words or tokens with their associated POS tags.
    2. Syntactic Analysis: In this phase, the grammatical structure of the input text is analyzed, also known as parsing. It involves identifying the relationship between words and constructing a parse tree that represents the syntactic structure of the sentence. This phase helps in identifying the subject, predicate, and object of a sentence, among other things.
    3. Semantic Analysis: Once the syntactic structure of the sentence is determined, the next phase is to extract the meaning of the sentence. Semantic analysis involves understanding the meaning of individual words and their relationships with each other in the sentence. It involves identifying entities, relationships, and concepts mentioned in the text.
    4. Discourse Analysis: This phase involves analyzing the context of the text beyond a single sentence. It involves analyzing the relationships between sentences and paragraphs to determine the overall meaning of the text. Discourse analysis helps in identifying the overall theme and intent of the text.
    5. Pragmatic Analysis: The final phase of NLP is pragmatic analysis, which involves understanding the context in which the text is being used. It involves analyzing the social, cultural, and situational context of the text. Pragmatic analysis helps in understanding the intended meaning of the text and identifying any implied meanings or cultural references.

    In summary, the five phases of NLP are lexical analysis, syntactic analysis, semantic analysis, discourse analysis, and pragmatic analysis. These phases are crucial in building intelligent systems that can understand and generate natural language text or speech.

  3. What are the various ambiguities that are encountered in NLP, explain with examples.

    Ambiguity is a common challenge in Natural Language Processing (NLP), where a single word or phrase can have multiple meanings, leading to ambiguity. Here are some examples of ambiguities encountered in NLP:

    1. Lexical Ambiguity: Lexical ambiguity arises when a word has multiple meanings, depending on the context in which it is used. For example, the word "bank" can refer to a financial institution or the edge of a river. In the sentence "I deposited money at the bank," the word "bank" is ambiguous, and its meaning can only be determined from the context of the sentence.
    2. Syntactic Ambiguity: Syntactic ambiguity arises when the structure of a sentence can be interpreted in multiple ways. For example, the sentence "I saw her duck" can be interpreted as "I observed her duck" or "I observed the bird she owns." The ambiguity arises because the word "duck" can function as a noun or a verb, leading to multiple interpretations of the sentence.
    3. Semantic Ambiguity: Semantic ambiguity arises when a sentence has multiple interpretations because of the meaning of the words used. For example, the sentence "He saw her with a telescope" can be interpreted in two ways. It can mean "He observed her using a telescope" or "He used a telescope to observe her." The ambiguity arises because the word "with" can have different meanings depending on the context.
    4. Anaphoric Ambiguity: Anaphoric ambiguity arises when a pronoun refers to a previously mentioned noun or phrase, but it is unclear which noun or phrase it refers to. For example, the sentence "John called Jane to talk about her grades. She was happy." The pronoun "she" can refer to either Jane or John, leading to ambiguity.
    5. Referential Ambiguity: Referential ambiguity arises when a noun phrase can refer to multiple entities. For example, the sentence "The man saw the boy with the telescope" can be ambiguous if there are multiple men and boys in the context. It can refer to any man who saw any boy with a telescope.

    In summary, ambiguity is a common challenge in NLP, and it can arise due to lexical, syntactic, semantic, anaphoric, or referential reasons. To overcome these ambiguities, NLP systems employ various techniques such as machine learning, rule-based approaches, and statistical models.

  4. Define types and tokens. How many types and tokens are there in the following sentence. “ They picnicked by the pool, then they lay back on the grass and looked at the stars”

    In Natural Language Processing (NLP), tokens refer to individual words or units of text that are separated by whitespace or punctuation marks. Types, on the other hand, refer to unique words or distinct forms that exist in a text.

    For example, in the sentence "The cat in the hat", there are four tokens ("The", "cat", "in", "the") and three types ("The", "cat", "in", "the").

    In the sentence "They picnicked by the pool, then they lay back on the grass and looked at the stars", there are 14 tokens and 12 types. Here is a breakdown of the types and tokens:

    Types: they, picnicked, by, the, pool, then, lay, back, on, grass, and, looked, at, stars.

    Tokens: They, picnicked, by, the, pool, then, they, lay, back, on, the, grass, and, looked, at, the, stars.

    Note that the word "the" appears twice in the sentence, so it is counted as two tokens but only one type. Therefore, the sentence has 14 tokens and 12 types.

  5. Explain the notion of the n-gram language model. Describe how Maximum Likelihood estimate of a trigram language model is computed. Why do we need smoothing (in language modelling)?

    The n-gram language model is a statistical language model used in Natural Language Processing (NLP) to predict the probability of a sequence of words. It estimates the probability of the next word in a sequence given the previous n-1 words. An n-gram is a sequence of n words from a given text, where n can be any positive integer. For example, in the sentence "I love to eat pizza," a trigram would be "I love to" or "to eat pizza".

    The Maximum Likelihood estimate of a trigram language model is computed by counting the occurrence of each trigram in a training corpus and then dividing it by the total number of trigrams in the corpus. The probability of a trigram can be computed using the following formula:

    P(w3 | w1w2) = count(w1w2w3) / count(w1w2)

    where w1, w2, and w3 are the three words in the trigram. The probability of a trigram is calculated by dividing the count of that trigram by the count of the bigram (w1w2) that precedes it in the training corpus.

    However, the problem with Maximum Likelihood estimation is that it tends to assign zero probability to unseen n-grams, which results in poor performance in language modeling. This problem is known as the "zero-frequency problem". To solve this problem, we need smoothing techniques to distribute some of the probability mass from the observed n-grams to unseen n-grams. Smoothing helps to improve the performance of a language model by assigning non-zero probabilities to unseen n-grams. Common smoothing techniques used in language modeling include Laplace smoothing, Kneser-Ney smoothing, and Good-Turing smoothing.


MidSem 2

  1. Write four applications of N-gram language model. Explain each one in brief.

    Here are four applications of the N-gram language model:

    1. Speech Recognition: The N-gram language model is used in speech recognition systems to identify the most likely sequence of words given an audio input. This is done by computing the probability of each possible sequence of words using the N-gram language model and selecting the sequence with the highest probability as the recognized speech output.
    2. Machine Translation: N-gram language models are used in machine translation systems to predict the most likely sequence of words in a target language given a sentence in the source language. This is done by computing the probability of each possible target language sentence using the N-gram language model and selecting the sentence with the highest probability as the translation output.
    3. Text Prediction: The N-gram language model is used in text prediction systems to suggest the most likely next word in a sequence of words. This is done by computing the probability of each possible next word using the N-gram language model and suggesting the word with the highest probability as the predicted next word.
    4. Spelling Correction: N-gram language models are used in spelling correction systems to suggest the most likely corrected spelling of a misspelled word. This is done by computing the probability of each possible corrected spelling of the word using the N-gram language model and suggesting the spelling with the highest probability as the corrected spelling.

    In all of these applications, the N-gram language model plays a crucial role in predicting the most likely sequence of words given a context, which is essential for accurate and effective natural language processing.

  2. What is morphological computation. Explain the concept of free and bound morpheme.

    Morphological computation is a concept in linguistics and computational linguistics that refers to the use of the structure and properties of words (morphology) to simplify the processing of language. It is based on the idea that the internal structure of words carries important information that can aid in language processing, such as inflectional markers, derivational affixes, and root forms.

    In linguistics, a morpheme is the smallest unit of meaning in a language. It is the basic unit that is combined with other morphemes to create words. Morphemes can be either free or bound.

    A free morpheme is a morpheme that can stand alone as a word and has its own meaning. For example, "book", "cat", and "house" are all free morphemes because they can be used as words on their own.

    A bound morpheme is a morpheme that cannot stand alone as a word and is always attached to another morpheme. Bound morphemes can be prefixes, suffixes, or infixes, and they typically modify the meaning of the free morpheme they are attached to. For example, in the word "unhappy", "un-" is a bound morpheme that is a prefix, and "-happy" is the free morpheme.

    The concept of free and bound morphemes is important in morphological analysis, which is the study of the internal structure of words. By identifying the free and bound morphemes in a word, linguists and computational linguists can gain insight into the meaning and structure of the language being studied. In addition, this knowledge can aid in natural language processing tasks such as text classification, information retrieval, and machine translation.

  3. What is Named Entity Recognition in NLP. Explain with example the ambiguities that can occur in NER.

    Named Entity Recognition (NER) is a task in Natural Language Processing (NLP) that involves identifying and extracting named entities from text, such as names of people, organizations, locations, dates, and numerical expressions. The goal of NER is to automatically identify and classify these entities in a text, which can be useful for many NLP applications such as text classification, sentiment analysis, and information extraction.

    However, there are several ambiguities that can occur in NER, including:

    1. Overlapping Entities: Sometimes, named entities can overlap with each other in a text. For example, consider the sentence "John works at Microsoft and lives in Seattle." In this sentence, "Microsoft" is an organization and "Seattle" is a location, but they both overlap with the name "John", which is a person. This can create ambiguity in identifying which named entity the text is referring to.
    2. Multi-word Entities: Named entities can also be composed of multiple words. For example, consider the entity "New York City" in the sentence "I am traveling to New York City next week." In this case, the entity is composed of multiple words, and it can be difficult to identify the entity boundaries correctly.
    3. Ambiguous Words: Some words can have multiple meanings and can be part of multiple named entities. For example, consider the word "Paris" in the sentence "I visited Paris last summer." Paris can refer to the city or the person's name, and it can be difficult to identify the correct entity without additional context.
    4. Unknown Entities: Named entities can also be unknown or unfamiliar, which can make it difficult to identify them correctly. For example, consider the sentence "I went to the store to buy some Kleenex." Kleenex is a brand name, and it may not be present in the NER model's dictionary of named entities.

    To overcome these ambiguities, NER models often use machine learning techniques and rule-based systems to identify named entities accurately. They may also use additional features such as context, part-of-speech tagging, and syntactic parsing to disambiguate named entities in a text.

  4. Explain the concept of Vector Semantics? Use the raw shortened table given below to calculate cosine similarity and show which of the words cherry or digital is closer in meaning to information. ----table---- row#1 pie, data, computer row#2 cherry, 442, 8, 2 row#3 digital, 5, 1683, 1670 row#4 information, 5, 3982, 3325

    Vector Semantics is a concept in Natural Language Processing (NLP) that represents words and their meanings as high-dimensional vectors in a vector space. In this approach, words are represented as vectors, and their meanings are captured by the distributional patterns of the words they co-occur with in a large corpus of text. The idea is that words that have similar meanings will have similar vector representations and will be close to each other in the vector space.

    To calculate the cosine similarity between two words in the vector space, we need to represent them as vectors and then calculate the cosine of the angle between them. The cosine similarity measures the degree of similarity between two vectors, where a value of 1 indicates that the vectors are identical, and a value of 0 indicates that they are orthogonal (completely dissimilar).

    Using the table given, we can represent the words as vectors by taking the values in each row as the components of the corresponding vector. For example, the vector representation of "information" would be [5, 3982, 3325]. We can then calculate the cosine similarity between this vector and the vectors for "cherry" and "digital" to determine which word is closer in meaning to "information".

    The cosine similarity between "information" and "cherry" is calculated as follows:

    cosine_similarity(info, cherry) = (5**·5+ 1683·3982 + 3325**·**1670) / sqrt(5^2 + 3982^2 + 3325^2) * sqrt(0^2 + 2^2 + 0^2) = = 0.017754

    cosine similarity between "information" and "digital”

    cosine_similarity(info, digital)

    The vectors are:

    a·b is the dot product:a·b = 5×5 + 3982×1683 + 3325×1670 = 12254481.

    Therefore, the cosine similarity is:(a·b) / (‖a‖ × ‖b‖)= 12254481 / (5187.675202 × 2370.952129)= 0.996321


    Unit 3

    Important Topics:

    Bag of words model

    The Bag of Words model is a common technique used in natural language processing (NLP) to represent text as numerical features that can be used in machine learning algorithms. It involves breaking up a piece of text into its individual words (or tokens), and counting the frequency of each word to create a numerical representation of the text.

    Here's an example of how the Bag of Words model works:

    Suppose we have the following two sentences:

    To create a Bag of Words representation of these sentences, we would first break them up into individual words and remove any stop words (common words such as "the" or "and" that don't add much meaning):

    Next, we create a list of all the unique words in the sentences (our "vocabulary"):

    Finally, we count the frequency of each word in each sentence and create a numerical representation of the text:

    | --- | --- | --- | --- | --- |

    In this representation, each row represents a sentence, and each column represents a unique word in the vocabulary. The numbers in the cells represent the frequency of each word in each sentence.

    The Bag of Words model is a simple but powerful technique that can be used in a variety of NLP tasks, such as sentiment analysis, document classification, and text generation. However, it has some limitations, such as the fact that it ignores word order and context, which can be important in some applications.

    Skip-gram model

    The Skip-gram model is another common technique used in natural language processing (NLP) to represent words as numerical vectors that can be used in machine learning algorithms. Unlike the Bag of Words model, which represents each word as a count of its frequency in a piece of text, the Skip-gram model considers the context in which a word appears to learn a more nuanced representation.

    Here's how the Skip-gram model works:

    1. We start with a large corpus of text (e.g. a collection of documents, a Wikipedia dump, etc.).
    2. For each word in the corpus, we define a "window" of surrounding words (e.g. the 5 words to the left and right of the target word).
    3. We create a training set of (word, context) pairs, where the word is the target word and the context is one of the words in the window.
    4. We train a neural network to predict the context given the target word (or vice versa). The weights of the neural network are used as the numerical representation of each word.

    Here's an example of how the Skip-gram model works:

    Suppose we have the following sentence:

    We choose a window size of 2, so each word is paired with the two words to its left and the two words to its right. This gives us the following training pairs:

    example