LLMs – Embeddings 01



This content originally appeared on DEV Community and was authored by Sakshi

Word Embeddings

An assignment of words to numbers is called a word embedding.

Consider an example, suppose we have a 2D plane and at multiple coordinates we have some labels like apple, banana, mango are near to each others coordinates, similarly carrot, radish and potato are near each other, and laptop, mobile, TV are far from the vegetables coordinate and fruit coordinate also. Now if we have to place “orange” then think around which coordinates it should be place?

It should be placed with fruits, near banana, mango and apple.

Words that are similar should correspond to points that are close by.

Words that are different should correspond to points that are far away.

There is something more to these word embeddings, and it is that they don’t only capture word similarity, but they also capture other properties of the language.

There is something more to these word embeddings, and it is that they don’t only capture word similarity, but they also capture other properties of the language.

Since each feature is one new axis, or coordinate, then a good embedding must have many more than two coordinates assigned to every word.

One of the XYZ embeddings, for example, has 5024 coordinates associated with each word. These rows of 5024 (or however many) coordinates are called vectors, so we often talk about the vector corresponding to a word, and to each of the numbers inside a vector as a coordinate.

understanding vector as an array of coordinates like – [ [3,4] [5,8] [7,9] [12,24] ]

Some of these coordinates may represent important properties of the word, such as age, gender, size. Some may represent combinations of properties. But some others may represent obscure properties that a human may not be able to understand. But all in all, a word embedding can be seen as a good way to translate human language (words) into computer language (numbers), so that we can start training machine learning models with these numbers.

Sentence Embeddings

A sentence embedding is just like a word embedding, except it associates every sentence with a vector full of numbers satisfying similar properties as a word embedding.

For instance, similar sentences are assigned to similar vectors, different sentences are assigned to different vectors, and most importantly, each of the coordinates of the vector identifies some (whether clear or obscure) property of the sentence.

Head over to Cohere Playgroud and try providing inputs, some sentences similar, some different and hit the RUN button.

There are problems associated with this solution, let us look what are those

Similarity Between Sentences

For large language models, it is crucial to know when two words, or two sentences, are similar or different.

We can measure similarity and dissimilarity between multiple things like movies/car models/ countries anything.

But for that, you need to brush up your engineering math.

I literally waited for the day when I would finally get rid of math, all my school and college life, but it seems I will never get rid of this subject. So here is a (link)[https://www.datacamp.com/tutorial/demystifying-mathematics-concepts-deep-learning] to study math concepts used in DL.

PLUS, check this link also, for learning cosine similarity and dissimilarity.

word embedding is an assignment of a list of numbers (vector) to every word, in a way that semantic properties of the word translate into mathematical properties of the numbers. What do we mean by this? For example, two similar words will have similar vectors, and two different words will have different vectors. But most importantly, each entry in the vector corresponding to a word keeps track of some property of the word. Some of these properties can be understandable to humans, such as age, size, gender, etc., but some others could potentially only be understood by the computer. Either way, we can benefit from these embeddings for many useful tasks.

Sentence embeddings are even more powerful, as they assign a vector of numbers to each sentence, in a way that these numbers also carry important properties of the sentence.

In this way, the sentence “Hello, how are you?” and its corresponding French translation, “Bonjour, comment ça va?” will be assigned very similar numbers, as they have the same semantic meaning.

Method 1 – DOT PRODUCT

Heard this before? MMMMM…MATHS High school maths

THANK ME LATER

The dot product, also known as the scalar product or inner product, is a mathematical operation that takes two equal-length sequences of numbers (usually vectors) and returns a single number

FULL TUTORIAL for those who did not study this concept well in school

How to decide how similar or dissimilar is one movie with others??

Notice that if two movies are similar, then they must have similar action scores and similar comedy scores. So if we multiply the two action scores, then multiply the two comedy scores, and add them, this number would be high if the scores match.

**

Dot product for the pair [You’ve got mail, Taken] = 0*7 + 5*0 = 0
Dot product for the pair [Rush Hour, Rush Hour 2] = 6*7 + 5*4 = 62

**

ANOTHER MEASURE : COSINE SIMILARITY 😈

Another measure of similarity between sentences (and words) is to look at the angle between them. For example, let’s plot the movie embedding in the plane, where the horizontal axis represents the action score, and the vertical axis represents the comedy score. The embedding looks like this.

Check above shared links for cosine similarity and dissimilarity.

we want a measure of similarity that is high for sentences that are close to each other, and low for sentences that are far away from each other. Distance does the exact opposite.

Let’s look at the angle between the rays from the origin (the point with coordinates [0,0]), and each sentence. Notice that this angle is small if the points are close to each other, and large if the points are far away from each other. Now we need the help of another function, the cosine. The cosine of angles close to zero is close to 1, and as the angle grows, the cosine decreases. This is exactly what we need. Therefore, we define the cosine distance as the cosine of the angle formed by the two rays going from the origin, to the two sentences.

Cosine distance measures the dissimilarity between vectors by calculating the cosine of the angle between them, cosine similarity quantifies how similar two vectors are based on the cosine of the same angle.
.
.
.
.
.
Thanks for reading till here…posting second blog soon in this series 🙂


This content originally appeared on DEV Community and was authored by Sakshi