After going over the results of some testing done by others (thanks realCool and FlyingCow), I'm quite confident that the spell power coefficients we use are correct. Yes, there are some discrepancies between the tooltips and the actual in-game behavior, but I appear to have caught the last of those when I discovered the correct shadow bolt coefficient the other day.
I think a more likely reason for the discrepancies between our results and GC's internal numbers is simply the complexity of our action lists and conditionals, plus other extraneous things like the specific gear/trinkets/glyphs/talents used by our profiles, when we pop bloodlust, etc. To explore what sort of effect these things might have, I made extremely simplified versions of the three pre-raid profiles, with no glyphs, no trinket usage, no potion usage, no bloodlust, no doomguard, etc. I also simplified the action lists and removed any tricks to squeeze out extra DPS or synchronize cooldowns with dots or anything like that, and I made each profile use Grimoire of Supremacy for a nice static pet contribution.
Basically, these profiles run through a perfectly naive version of each spec's basic rotation, and any external influences that might provide synergies that GC's internal tools might not take into account have been removed. The end result is very interesting:
These results perfectly match the relative differences between the three specs that GC described in his post. And this is my theory for why their internal numbers don't match ours - I think their tools simplify the situation too much. Only a theory, of course - I have no idea what sort of tools they have at their disposal or how they use them - but it's a possibility worth keeping in mind.