I believe that the answer the author has in mind is something more in the line of binary long division, which can be "efficiently" implemented with shifts, subtractions and comparisons in $O(W)$ steps where $W$ is the machine word length in bits.
Notice that using bit shifts, which could be interpreted as a way to reintroduce multiplication thru the back door, is not a necessary condition to obtain an algorithm linear in $W$ given that they can in turn be substituted by iterated doubling of the dividend,
const int W = 8*sizeof(int);
int p2w[W];
p2w[0]=p;
for (int w=0; w<W-1; ++w)
p2w[w+1]=p2w[w]+p2w[w];
which can generate all the necessary elements in the sequence ${2^w p}$ for $0\le w \le W$ also in linear time, at the cost of $O(W)$ extra memory.
We can then divide $q=a/p$ using no more than $O(W)$ operations as follows:
int q=0;
for (int w=W-1; w>=0; --w) {
q+=q; // same idea as before
// binary long division
if (a>p2w[w]) {
a-=p2w[w];
++q;
}
}
This analysis obviates the complexity and delay of the digital circuits implemented in hardware and assumes addition and the other instructions run in constant time. A more sophisticated analysis could take into account the combined algorithmic+circuit complexity by counting the total number of gates and the cycles necessary to execute the solution, which is similar to the concept of PDP (power-delay product) in electronic engineering.