LLM Chat GPT

Some notes on Recurrent Neural Network: A neural network which has a high hidden dimension state. When a new observations comes it updates its high hidden dimension state.

In machine learning there is lot of unity in principles to be applied to different data modalities. We use the same neural net architecture, gradients and adam optimizer to fine tune the gradients. For RNN we use some additional tools to reduce the variance of the gradients. For example: using CNN for image learning or Transformers to NLP problems. Years back in NLP for every tiny problem there was a different architecture. 

Question : Where does vision stop and language begin

  1. Proposed future is to develop Reinforcement Learning  techniques to help supervised learning perform better.
  2. Another are of active research is Spike-timing-dependent plasticity. The concept of STDP has been shown to be a proven learning algorithm for forward-connected artificial neural network in pattern recognition. A general approach, replicated from the core biological principles, is to apply a window function (Δw) to each synapse in a network. The window function will increase the weight (and therefore the connection) of a synapse when the parent neuron fires just before the child neuron, but will decrease otherwise.

With Deep learning we are looking at a static problem with a probability distribution and applying the model to the distribution.

Back Propagation is useful algorithm and not go away, because it helps in finding a neural circuit subject to some constraints.

For Natural Language Modelling it is proven that very large datasets work because we are trying to predict the next word by broad strokes and surface level pattern. Once the language model becomes large, it understand the characters, spacing, punctuations, words, and finally the model learns the semantics and the facts.

Transformers is the most important advance in neural networks. Transformers is a combination of multiple ideas in which attention is one in which attention is a key. Transformers is designed in a way that it runs on a really fast GPU. It is not recurrent, thus it is shallow (less deep) and very easy to optimize.

After Transformers to built AGI, research is going on in Self Play and Active Learning.

GAN's don't have a mathematical cost function which it tries to optimize by gradient descent. Instead there is a game in which through mathematical functions it tries to find equilibrium.

Other example of deep learning models without cost function is reinforcement learning with self-play and surprise actions. 


Double Descent:

When we make neural network larger it becomes better which is contrarian to statistical ideas. But there is a problem called the double descent bump as shown below;

Double descent occurs for all practical deep learning systems. Take a neural network and start increasing its size slowly while keeping the dataset size fixed. If you keep increasing the neural network size and don't do early stopping then, there is increase in performance and then it gets worse. It the point the model gets worst is precisely the point at which the model gets zero training error or zero training loss and then when we make it larger it start to get better again. It counter-intuitive because we expect the deep learning phenomenon to be monotonic.

The intuition is as follows:

"When we have a large data and a small model then small model is not sensitive to randomness/uncertainty in the training dataset. As the model gets large it achieves zero training error at approximately the point with the smallest norm in that subspace. At the point the dimensionality of the training data is equal to the dimensionality of the neural network model (one-to-one correspondence or degrees of freedom of dataset is same as degrees of freedom of model) at that point random fluctuation in the data worsens the performance (i.e. small changes in the data leads to noticeable changes in the model). But this double descent bump can be removed by regularization and early stopping."

If we have more data than parameters or more parameters than data, then model will be insensitive to the random changes in the dataset.

Overfitting: When model is very sensitive to small random unimportant stuff in the training dataset.

Early Stop: We train our model and monitor our performance and at some point when the validation performance starts to become worse we stop training (i.e. we determine to stop training and consider the model to be good enough)


ChatGPT:

ChatGPT has become a water-shed moment for organization because all companies are inherently language based companies. Whether it is text, video, audio, financial records all can be described as tokens which can be fed to large language models.

A good example of this is when during training of ChatGPT on amazon reviews, they found that after large amount of training the model became an excellent classifier of sentiment. So the model from predicting the next word (token) in a sentence, started to understand the semantics of the sentence and could tell if the review was a positive or negative.

With Advancement of AI, we have a likeness of a particular person as a separate bot, and the particular person will get a say, cut and licensing opportunities of his likeness.