Most Popular
1500 questions
22
votes
2 answers
How to define states in reinforcement learning?
I am studying reinforcement learning and the variants of it. I am starting to get an understanding of how the algorithms work and how they apply to an MDP.
What I don't understand is the process of defining the states of the MDP. In most examples…

Andy
- 323
- 1
- 2
- 6
22
votes
1 answer
How does the (decoder-only) transformer architecture work?
How does the (decoder-only) transformer architecture work which is used in impressive models such as GPT-4?

Robin van Hoorn
- 2,366
- 1
- 10
- 33
22
votes
2 answers
Why would you implement the position-wise feed-forward network of the transformer with convolution layers?
The Transformer model introduced in "Attention is all you need" by Vaswani et al. incorporates a so-called position-wise feed-forward network (FFN):
In addition to attention sub-layers, each of the layers in our encoder
and decoder contains a…

Eli Korvigo
- 321
- 1
- 2
- 6
22
votes
1 answer
Has the Lovelace Test 2.0 been successfully used in an academic setting?
In October 2014, Dr. Mark Riedl published an approach to testing AI intelligence, called the "Lovelace Test 2.0", after being inspired by the original Lovelace Test (published in 2001). Mark believed that the original Lovelace Test would be…

Left SE On 10_6_19
- 1,660
- 9
- 23
22
votes
3 answers
Why doesn't Q-learning converge when using function approximation?
The tabular Q-learning algorithm is guaranteed to find the optimal $Q$ function, $Q^*$, provided the following conditions (the Robbins-Monro conditions) regarding the learning rate are satisfied
$\sum_{t} \alpha_t(s, a) = \infty$
$\sum_{t}…

nbro
- 40,472
- 12
- 105
- 192
21
votes
5 answers
Why does Batch Normalization work?
Adding BatchNorm layers improves training time and makes the whole deep model more stable. That's an experimental fact that is widely used in machine learning practice.
My question is - why does it work?
The original (2015) paper motivated the…

Kostya
- 2,515
- 10
- 24
21
votes
3 answers
Is a dystopian surveillance state computationally possible?
This isn't really a conspiracy theory question. More of an inquire on the global computational power and data storage logistics question.
Most recording instruments such as cameras and microphones are typically voluntary opt in devices, in that,…

Harrison Tran
- 319
- 2
- 6
21
votes
2 answers
What is the difference between First-Visit Monte-Carlo and Every-Visit Monte-Carlo Policy Evaluation?
I came across these 2 algorithms, but I cannot understand the difference between these 2, both in terms of implementation as well as intuitionally.
So, what difference does the second point in both the slides refer to?
user9947
20
votes
1 answer
Why do you not see dropout layers on reinforcement learning examples?
I've been looking at reinforcement learning, and specifically playing around with creating my own environments to use with the OpenAI Gym AI. I am using agents from the stable_baselines project to test with it.
One thing I've noticed in virtually…

Matt Hamilton
- 333
- 2
- 5
20
votes
4 answers
Why do we need floats for using neural networks?
Is it possible to make a neural network that uses only integers by scaling input and output of each function to [-INT_MAX, INT_MAX]? Is there any drawbacks?

elimohl
- 311
- 1
- 2
- 5
20
votes
3 answers
How are Artificial Neural Networks and the Biological Neural Networks similar and different?
I've heard multiple times that "Neural Networks are the best approximation we have to model the human brain", and I think it is commonly known that Neural Networks are modelled after our brain.
I strongly suspect that this model has been simplified,…

Andreas Storvik Strauman
- 491
- 3
- 15
20
votes
3 answers
How can we process the data from both the true distribution and the generator?
I'm struggling to understand the GAN loss function as provided in Understanding Generative Adversarial Networks (a blog post written by Daniel Seita).
In the standard cross-entropy loss, we have an output that has been run through a sigmoid function…

tryingtolearn
- 385
- 1
- 2
- 10
20
votes
2 answers
How do neural networks play chess?
I have been spending a few days trying to wrap my head around how and why neural networks are used to play chess.
Although I know very little about how the game of chess works, I can understand the following idea. Theoretically, we could make a…

stats_noob
- 329
- 1
- 11
20
votes
2 answers
Why does GPT-2 Exclude the Transformer Encoder?
After looking into transformers, BERT, and GPT-2, from what I understand, GPT-2 essentially uses only the decoder part of the original transformer architecture and uses masked self-attention that can only look at prior tokens.
Why does GPT-2 not…

Athena Wisdom
- 351
- 1
- 2
- 5
20
votes
2 answers
What is the "Hello World" problem of Reinforcement Learning?
As we all know, "Hello World" is usually the first program that any programmer learns/implements in any language/framework.
As Aurélien Géron mentioned in his book that MNIST is often called the Hello World of Machine Learning, is there any "Hello…

Arpit-Gole
- 394
- 2
- 9