I have no experience of serious expert system development, but I've played with some related algorithms.
Bayes theorem is interesting for when your information contains the wrong probabilities - the probability of x given y when you need the probability of y given x, basically.
ID3 is an easy-to-understand way of deciding which question to ask first, so you can turn a huge table of facts into a decision tree. Strictly, I don't think ID3 can cope with an exclusive-or kind of decision, but it's not that hard to adapt.
Basically, it uses "entropy" calculations - the weighted average of the amount of information you would gain if you received a particular answer, given the probability of that answer (both for the information calculation and the weight). In an either-or choice, a very unlikely answer gives you a lot of information - if you get that answer. Once you weight the answers, the questions that give the most information on average tend to have balanced answer probabilities. One issue is that a question with a lot of (well balanced) answers has more entropy than one with a few possible answers. That actually works well for my multiple dispatch thing, but might mean that a real expert system would tend to ask awkward questions first rather than simple ones.
I use a variation on the theme for a multiple-dispatch handling code generator utility. Yes, it's overkill, but I had that code already written out of interest, so it made sense to use it.
The thing about multiple dispatch functions, though, is that there's only a few "what run-time type is that parameter?" questions to consider. In real life, ID3 is supposed to be a bit slow, so there's loads of alternatives. One classic is I think called CN4.5. I never spent the time to understand it, though.
ID3 and similar decision-tree building algorithms are often called "rule induction" algorithms. I think I first saw ID3 in an ancient issue of PC World, where the example was identifying a coin by its properties (round or polygon, silver or bronze etc).
Of course algorithms and snippets of probability theory, in themselves, don't add up to much. Even if you know the basic algorithms, that seems to be only the first small step to understanding how to apply them. For example, my multiple dispatch thing is very neat - the table of questions (parameters), answers (run-time types) and conclusions (which implementation) is fully defined by the rules of the tool. I don't have to worry about things like asking too many questions, and therefore parroting specific training examples rather than learning principles. And I'm very glad I don't have that subjective issue to worry about.