Why are all the states 0 ?
Oversimplified, there are three main components to any quantum circuit: the input, the quantum function, and the output. QML research will usually fall into two buckets. In the first, we take a fixed quantum function and feed it different inputs, and see how the output varies. In the second, we take a fixed input and play around with the quantum function and see how the output varies. So when you see the circuit initialized with all 0's, you can think of the input state as the 'controlled variable' and the quantum function as the 'independent variable.'
What are we measuring?
On each training iteration, we measure each qubit's spin state in the chosen computational basis (usually the Z basis). We then interpret the combined output of all qubits as a classical piece of information. Here, this output is the QNNs 'prediction' associated with the given (feature-mapped) input data.
What exactly are we optimizing?
The body of a QNN consists of a feature-map component and a variational component. The former is where we 'encode' the input data. The ladder is where we store our trained parameters. These often take the form of varying degrees of X, Y, and/or Z rotations. On each training iteration, we evaluate the circuit many times for each input data vector all using the same variational parameters. We then construct a distribution based on the outputs of the circuit, and can then determine how accurate the parameters we chose were to representing the input/output relationship for this circuit design according to a chosen cost function. At the end of the training iteration, the parameters are adjusted (trained) in the direction that minimizes this cost function. This can be done through gradient descent, as you mentioned, SPSA, or many other classical algorithms.