Just something to go with the tracking via stacks, sometimes I notice that I have a bugged stack of ore which, when prospected, doesn't pop up the loot window. Usually I have to break down the stack and remake it to fix the problem.
More specifically, given a bugged stack of ore, when you split it one of the resulting daughter stacks will be bugged (B), and the other will be non-bugged (N). I believe it's always the "new" stack (i.e. new inventory slot) that is non-bugged.
If you then remake the original stack by adding N back to B, you still get a bugged stack.
If you remake it by adding B to N, you get a non-bugged stack.
This makes perfect sense if the bug stays with the stack ID. When you split, you create a new unbugged stack (with new ID). Adding that one back to B just updates the counter on B and doesn't get rid of the bug. However, if you add B to N, it updates the counter on N, and you get an unbugged full stack back again.
Just to confirm, this is definitely not the case anymore. During the first few days of TBC, Eastern Hellfire Peninsula would crash and drop people out regulary, whilst Western Hellfire Peninsula would remain completely stable - we advised all guildmates to move to that part of the zone to avoid the mass crashing.
I suspect you are both actually correct here... A lot of my friends and coworkers were in the original closed F&F for wow. At the time I worked on the software engineering team at a large computer company, so a significant number of my coworkers were very curious about how the servers were architected, and had the chance to see a lot of failure modes that were fixed before the game was launched.
One of them was initially convinced different zones were on different servers due to glitches he sometimes saw when he was kiting mobs between them. Later on, when he saw continent wide crashes, he concluded that zones were probably seperate processes using IPC between each other on the same piece of hardware.
Engineering it like that would make a lot of sense. It solves certain scalability issues. It also allows for a degree of fault isolation. It has certain development advantages as well (devs may be able to get away with smaller machines to just load of specific zone servers while doing work on them).
This is also consistent with some behaviour we have seen in live. We know continents can crash as a whole. On the other hand, I remember back when the honor system was first rolled out I started hitting heavy lag the moment I would walk into Hillsbrad. So clearly there is some level of per zone distribution going on.
In the case of outlands, it would not shock me if HFP is actually 2 or 3 zones on the backend.
I don't see how that explains model hacks, however. For example, back on one PTR, Hunters were able to fly by model-swapping their trap models with staircases. They'd walk up their traps, lay another one, and keep walking. Also, another example would be the client-side model hack that allowed people to skip all the way to C'thun.
There was another model that people replaced to walk through walls but I can't remember what it was at the moment.
I see two possibilities for walking through walls.
1. You actually hack your client to allow you to walk through anything, like the good ol' days of "Doom."
2. (Probably the case in AQ40 and Alterac Valley) The developer who built the terrain thought he painted a solid wall, but accidentally left a small gap. A gnome legitimately fits through this gap, no exploit involved... it's just a mistake.
I would expect ixing this to require changing the zone's map. Since they didn't fix it, I'm guessing something about Blizzard's patching process makes this more difficult to send to us than it sounds.
I've walked through a meeting stone before it loaded on clientside for me. I sat inside it for a bit after it had loaded before I proceeded to walk out of it. Happened other times aswell and with other objects I believe, so some of it certainly seems to be client side.
Collision detection with doodads appears to be mostly client-side. On beta, the Draenei level ~15 quest that required you to go the top of an island and destroy the statue of Azshara was bugged. No one could get close enough to click on the statue, even if you were standing on top of it. Clearly somewhere the "use range" (or whatever) was set far too low. I replaced the model of that statue (it is in the Blackfathom Deeps section of the MPQ, go figure) with a model of a small plant. I proceeded to reload the game, walk to the middle of where the statue would be, and gear-click the plant.
No, it was definitely server related. There was a line you could draw roughly down the middle of the map, and once you crossed that line, loot lag and serve lag would immediately vanish. It didn't matter how many people were in your particular vicinity at the time, there was a very marked East / West HFP mark.
Well that definitely hints in that direction. Let me try an alternate explanation out for size, though. We know that players are only allowed to see a limited amount of the world at a time. (You have a horizon beyond which you cannot see/detect/interact with anything.) It's possible that the database queries and server code are written with this in mind.
What this means is that if you have, say, 100 (or more) users in close proximity to you, then actions which require consistency propagation to the 100 (or so) players nearby will cause lag and locking issues for these people. People far away, even though they're in the same server instance, will not experience this behavior because they're not sharing state with these other players and are not subject to their DB requests being locked as frequently (or having to propagate data back and forth between server and client about relative state and so forth.)
This fits neatly, actually, with your example. In the early days of the expansion, Eastern HFP is where all the early quests take you. Most people don't generally head to western HFP until you've finished a number of quests (assuming typical questing behavior) and most of the activity in Western HFP isn't available to you until you've completed quests to the east (and/or gained a level or two.) This even explains the "line" phenomenon. You're crossing the consistency horizon formed by the aggregate block of people in the east.
Indeed, this is even supported to some degree by in-game behavior. When one teleports or hearths to a location that is on the same continent one is already on there is no zoning behavior. But there IS a bit of lag as your client syncs up with the world around you. You can even test this by running to (say) the BG Battlemasters in Org and then casting Teleport: Orgrimmar. It's doubtful Orgrimmar is split among servers, but you still get the brief sync-up lag as you appear at the port-in spot.
Of course, it's disingenious to talk of a single server. Each zone and area will be being run on multiple servers simultaneously, and what could have happened was that the zone was split into two for the purposes of server architecture and the western half was put on a different subset of servers compared to the eastern part, in order to ease the strain on the first day. We know, since flying was introduced, that there is no technical reason for there being a few single points of entry between each zone, it's just a design issue, and so there's no reason that HFP couldnt be being run on seperate server sub clusters in order to improve performance.
While it's a plausible theory, I just don't believe it's the case. If it was, I'd expect zoning into instances to be a much less heavyweight operation. If they have a way to do a lightweight, in real time transfer of session information from one server to another as you run around the world (and expect near real-time response from the server while you do so) then why would zoning from one instance to another take as long as it does? (and I consider EK, Kalimdor, and such as just mega-instances, which I suspect is precisely what they are)
Unfortunately, this is one of those areas where without insider information you can't answer the question one way or the other - it's just inference based on failure modes and typical behavior. More than likely neither theory is completely correct -- I wouldn't be surprised if Blizzard is doing something more clever still than anything we've postulated here thus far.
Last edited by Kerruul : 02/22/07 at 3:33 PM.
Reason: De-mangling and clarifications.
There's a good discussion of the Halo 2 network model and how they moved from a synchronous LAN model with a master machine to an Internet/WAN model where each machine has a view of what's happening that might not necessarily be entirely the same as another.
There's an interesting bit relating how the machines decide which information to send to each other, e.g. stuff in front of you is more important than behind, a grenade exploding is more important than shrapnel marks on a wall, etc. The WoW servers probably make the same kinds of decisions when sending stuff to clients.
On the "what do they do in weekly maintenance" question, I've heard sysadmins in the past advocate rebooting all your machines every week, regardless of whether you think they need it. It restores everything to a known state, flushes any memory leaks (I seriously doubt WoW has any major leaks, they'd be screwed by anything other than a tiny leak) and generally clears out cobwebs. There's been plenty of times in my career where we've deployed some new piece of software, restarted a server in doing so, and discovered some kind of config change that hadn't taken effect yet on that server. Result: new software looks like it punked out, when in fact someone had just left a time bomb in the system. Weekly reboots mitigate this somewhat.
Realizing these aspects of server structure makes it all the more amusing when the General forums whine about "fix my server" or "upgrade my slow realm"... It's a lot more involved than sitting down at one keyboard and pressing Control-Alt-Delete. Maintenence on "a realm" involves checking the software configuration and wiring for 4-5 boxes... It's no wonder upgrading the hardware last year required 24-hour downtimes - a swarm of technicians had to reorganize something like 40 machines per hosting site, ensuring that each one was configured for the correct function and wired to its sister boxes!
I have no way to say how reliable this information is, but a SAN guy I know claimed each realm essentially used 24 servers. I just don't see that configuration working with the amount of realms they now support.
On the "what do they do in weekly maintenance" question, I've heard sysadmins in the past advocate rebooting all your machines every week, regardless of whether you think they need it. It restores everything to a known state, flushes any memory leaks (I seriously doubt WoW has any major leaks, they'd be screwed by anything other than a tiny leak) and generally clears out cobwebs. There's been plenty of times in my career where we've deployed some new piece of software, restarted a server in doing so, and discovered some kind of config change that hadn't taken effect yet on that server. Result: new software looks like it punked out, when in fact someone had just left a time bomb in the system. Weekly reboots mitigate this somewhat.
Back when I ran AIX machines in an enterprise environment I never advocated shutting them down. It all depends on the applications and the OS you run. I did however schedule my Server 2000 boxes for weekly reboots.
The WoW Tuesday maintenance schedule is connected to the original issues the game had two years ago. Random crashes, massive lag, loot lag and a variety of other issues grew between reboots. Now that it seems most systems scale better with load and those old pesky problems are resolved, the systems may not always need the restart.
I've moved past daily heads down tech work, so YMMV
Storage capacity isn't the issue, that is a "relatively simple" problem (It's not really simple at all but it is a problem that has been solved and documented thoroughly, see google).
IO and raw processing power are the commodities that will be in short supply. You can address this in a variety of different ways. From the sheer number of machines blizzard has purchased (and the relatively modest specs) they are probably using some form of clustering or other means of virtualization.
Without knowing details about their server software implementation it is pretty much impossible to say how many actual machines make up each server.
For reference though, a couple years ago I did some work for the company hosting Horizon's 2 beta servers. They were using the traditional idea of "zone" servers that EQ used way back. Each geographic area was associated with specific machines. All of which shared access to a a couple load balanced SQL servers. They had about 6 racks filled floor to ceiling with 1/2U boxes and a few larger SQL machines. That comes out to about 100 machines, for 2 servers. Granted that setup is rather wasteful as you are unable to redistribute resources but that setup was also serving way less people than your typical WoW server.
While it's a plausible theory, I just don't believe it's the case. If it was, I'd expect zoning into instances to be a much less heavyweight operation. If they have a way to do a lightweight, in real time transfer of session information from one server to another as you run around the world (and expect near real-time response from the server while you do so) then why would zoning from one instance to another take as long as it does? (and I consider EK, Kalimdor, and such as just mega-instances, which I suspect is precisely what they are)
It seems to me that 99% of the work that needs to be done is client side. You have to load terrain and objects that have absolutely no overlap with where you currently are. You aren't gradually moving to a new area like you do out in the world, you are just suddenly in a new place. Then the server has to tell you where everything is, what everything is, and you have to load it all.
Server side, I would wager there isn't much difference from hellfire->zangarmarsh vs hellfire->ramparts.
It seems to me that 99% of the work that needs to be done is client side. You have to load terrain and objects that have absolutely no overlap with where you currently are. You aren't gradually moving to a new area like you do out in the world, you are just suddenly in a new place. Then the server has to tell you where everything is, what everything is, and you have to load it all.
Server side, I would wager there isn't much difference from hellfire->zangarmarsh vs hellfire->ramparts.
That wouldn't explain why when one continent crashes, another is left untouched. Furthermore, what happens when you hearth from Cenarion Hold in Silithus to Orgrimmar? Does your computer lag a bit in loading the massive data in Org? Yep. Do you get a loading screen? Nope. Something deeper is going on server-side that is transferring your character from one process or one machine to another.
The server knows next (probably) to nothing about the layout of the world, where there are stairs and where there isn't.
Generally that's to be expected. But scenarios like mobs that don't aggro because you're standing a floor above them (but still within aggro range Z-axis wise), LOS pulling and other abilities that are LOSable do heavily indicate that the server does have a (very very) stripped geometry view of the world to do certain LOS checks against. Only other way would be having part of the mob AI handeled by a client which seems like a very bad idea and would bring up yet more sync issues to deal with.
Just as an example as the multiple servers per continent theory may make sense. Way back in the days of EQ, the designers once gave an interview discussing the architexture of the game. They were very adamant that even *one* of EQ's old zones couldn't be run on a single server - each zone, for example the Temple of Veeshan, was spread across multiple servers in a small cluster. When they needed to reduce load on that zone, they moved the zone onto a set of servers with zones that were much lower in population.
Whilst obviously, server technology has greatly improved since then, so has the level of detail and information being stored and manipluated at high speeds on it. I really doubt that there is just one server for each continent, as that woudl go against the previous soilutions fom other companies, but also would mean the crash rate for servers would either be astronomically higher than it is now, or that Blizzard owns the worlds most reliable servers ever, as there's no way you could keep a single server going for weeks at a time without it keeling over at some point. But one server going out of the cluster than say runs Kalimdor would be a lot easier to manage, and would just ause slight amounts of lag across all zones as the other servers took up the slack.
But as everyone says, this really is just fun conjecture right now. All I would say is that re. the WHFP & EHFP issue, it didn't seem like to do with the population around you. I played on the very first day, and although many areas of eastern HFP were extremely crowded, the very northern area, near the swamp and the terrorfiends, was remarkably empty - everyone was doing the Spinespliter hold quests and killing fel orcs. However, despite their being lots of wide open space, plenty of spawned mobs and really a very empty envionrment, the zone still lagged. But as soon as you crossed that line down the middle of the zone, it vanished - even though there were more people in the west of the zone that there were at the north at the time. (Judging, obviously, from what I could see on screen and what everyone I knew was doing at the time). Also, the lag around spinebreaker hold, which must have had the vast majority of players in it on the first day, was exactly the same as the much more deserted swamp area to the north, despite massive differences in population.
Server side, I would wager there isn't much difference from hellfire->zangarmarsh vs hellfire->ramparts.
There are several indications that a subzone to subzone transfer (the one where you see the new area name in the middle of your screen all of the sudden) is MUCH lighter than an instance transfer. (I'm assuming subzones do share either memory or some storage, but I have absolutely no doubt that Blizzard can split a continents _sub_zones across multiple physical machines also.)
What happens when you zone from one instance to another is that your character and all his currently equipped slots and buffs are written in the database (very very heavy operation), and then read out by the target instance server again. You can observe this behaviour usually just before a realm's DB provider dies. Zoning with the zeppelin can take ages.
The most definite indicator that supports this assumed process is the "save weapon enchants during zoning" change that happened somewhen before 2.0 and that they mysteriously had to reverse in an emergency patch with the reason that they underestimated the impact on their hardware. Simply handing a small weapon enchantparameter directly from instance server to instance server would be much easier done than adjusting your database structure to carry such new information. (Altering tables for minor patches on big databases isn't that trivial as it might appear to mysql users. )
Last edited by spinal : 02/23/07 at 6:10 AM.
Reason: spelling
Generally that's to be expected. But scenarios like mobs that don't aggro because you're standing a floor above them (but still within aggro range Z-axis wise), LOS pulling and other abilities that are LOSable do heavily indicate that the server does have a (very very) stripped geometry view of the world to do certain LOS checks against. Only other way would be having part of the mob AI handeled by a client which seems like a very bad idea and would bring up yet more sync issues to deal with.
Don't forget a floor is not an object. And object is something you can click, or a plant, something extra that they added. These are all purely client side. Generally if you can interact with it, its an object, if its there to fill out something its an object. Stuff like, trees, carts, player models etc are all objects. Only mobs/players are really of any concern to the server. Everything else is most likely client side.
Standing above a mob on a floor is fine, it will not allow the mob to aggro through the floor, just as it (should) not aggro through the wall. It's also possible that the floor is an object with a "can't aggro through this" flag.
As far as outlands being down, and the rest is working fine, this to me suggests there are 4 distinct seperate server clusters. Outland, the 2 old world continents and the instance servers. Any one of those 4 can be "down" without affecting the remaining 3. When a player passes between them, its a much bigger move hence the loading screen. While a player moving around Outland, remains in the same cluster so he never has to go to a loading screen, the data is just passed from server to server seamlessly. However as soon as he enters an instance, it requires the data be sent to another cluster etc.
I have designed maps for several games. Even well known ones =). The way it usually worked is we drew the ground first. Once that is done you add your objects, the stuf fthat brings it to life, tree's etc. Then you add your collision boxes, so if you hit it, you get stopped/crash. In our games, as far as I am aware, the server (even in multiplayer) was not aware of every collision box. It was 90% done client side.
More complicated games with dedicated, industry run servers (mostly MMOs) would definately have the server be somewhat aware of the collision boxes as it would be very easy to exploit if it was done client side. However, no way is it awae of everything. I assume the staircase bug with hunters was an easy fix that just required that specific stair model to be handled by the server and not client anymore.
*edit*
When I say we drew the ground, this is the base level of everything. Sometimes the ground would also be a castle for example. Textures etc make them seem to be different things, but its purely the textures making it look that way. According to the game they are the same immovable substance. I would be surprised if WoW is not the same way. From a performance point of view, the less objects you use, the better it looks. Thats why we hate trees! They are too complex to be part of the ground, bastard bastard trees and they millions of polys to look good!
There is light at the end of the tunnel.
The only problem is, it's often an incoming train.
Yea, that's what I was implying with a very stripped version of the world. The "level" the server has in memory does really only have to be a few collision boxes like the floor and a few boxes where noteworthy areas are, such as LOS doorways or corners. That's why the AQ model changing shortcut did work probably, the collision boxes are only polled by the AI module and maybe the spellcast module. Constant position verification by every single player position update would be too costly.
Infact one could go even further now and assume that the npc A.I. is not actually part of the server but is infact seperate software acting as (scripted, local) clients from the servers perspective, official blizzard-made bots so to speak. Relieving the actual instance server of the memory requirement that having levels loaded could have. That would also explain why most mobs do have similar spell mechanics as players (example: interrupt silencing a certain spell family) or that they seem to have no hardware performance issue putting in more and more mindcontrolling npc's (Blackheart anyone?).
So, enough of me speculating.
Generally that's to be expected. But scenarios like mobs that don't aggro because you're standing a floor above them (but still within aggro range Z-axis wise), LOS pulling and other abilities that are LOSable do heavily indicate that the server does have a (very very) stripped geometry view of the world to do certain LOS checks against. Only other way would be having part of the mob AI handeled by a client which seems like a very bad idea and would bring up yet more sync issues to deal with.
The LoS issue is quite interesting actually. I have no idea how they do it
To me it sounds like a lot of processing done on the server for nothing to have a "raw" version of the world in memory all the time and do a lot of los checkings. Take for example Chromaggus, ~30-35 people (minus MT and some healers that are always out of LoS) running in and out of LoS and it all has to be done very fast, it puts a lot of stress on the server. Maybe that was partly why BWL was sometimes such a pain though lagwise.
The other solution would be that the client handled most of the LoS issues. Most of the time you are NOT out of LoS. Just temporarly while fighting. It could be that the client sends while it is in combat "I'm in LoS to this, this, this and those." Of course this opens up to all sorts of exploits, if blizzard took this route I'm sure they made some sanity checks on the server.
Honestly I don't know which route they did, maybe it's a combination of both.
I guess one way to find out would be to take your average caster, target a mob behind a mob. Begin casting a spell and see if there's any lag between the time you start casting and the time you get the "Target is not in line of sight." error message, if it is it might mean it's done server side. But it could also be that the client itself visually "lags".
Yea, that's what I was implying with a very stripped version of the world. The "level" the server has in memory does really only have to be a few collision boxes like the floor and a few boxes where noteworthy areas are, such as LOS doorways or corners. That's why the AQ model changing shortcut did work probably, the collision boxes are only polled by the AI module and maybe the spellcast module. Constant position verification by every single player position update would be too costly.
Infact one could go even further now and assume that the npc A.I. is not actually part of the server but is infact seperate software acting as (scripted, local) clients from the servers perspective, official blizzard-made bots so to speak. Relieving the actual instance server of the memory requirement that having levels loaded could have. That would also explain why most mobs do have similar spell mechanics as players (example: interrupt silencing a certain spell family) or that they seem to have no hardware performance issue putting in more and more mindcontrolling npc's (Blackheart anyone?).
So, enough of me speculating.
I think they'd pretty much have to do what you describe here. And those "client" AI machines probably even process stuff like "does boss AE not hit player X due to LOS". And I could see player power success checks being offloaded as well. We already know that what we see reported by the client is really an approximation which is why we see sometimes see in our combat log that our heal landed but the player died anyway... I could easily see that being a case where where one server said "yep it should hit" but the other said "he ran out of hit points" and the conundrum being caused by a bit of back end latency.
And all of that really should be handled outside of the main "server" anyway... the main core server is probably just a giant message router after all. And its performance is most likely being judged as transactions per nanosecond while LOS and other more intensive tasks are probably more on the order of transactions per millisecond.
Range checks appear to be done both clientside and serverside. If you're out of range enough that the UI indicates it, you can't even start casting a spell. However, there's been a lot of times with Mind Flay where my UI indicates that the target is in range, and I can start casting it, but like with Dispel it stop casting and cancels the GCD after .2 seconds or so. LoS appears to be server-side only, as spells always start casting and then cancel, which nothing which is known to be client side does. I suppose the way to test would be to kill a model clientside, and see if you can cast through it. I suspect it's calculated serverside on every spell cast, though.
Range checks appear to be done both clientside and serverside. If you're out of range enough that the UI indicates it, you can't even start casting a spell. However, there's been a lot of times with Mind Flay where my UI indicates that the target is in range, and I can start casting it, but like with Dispel it stop casting and cancels the GCD after .2 seconds or so. LoS appears to be server-side only, as spells always start casting and then cancel, which nothing which is known to be client side does. I suppose the way to test would be to kill a model clientside, and see if you can cast through it. I suspect it's calculated serverside on every spell cast, though.
Range is almost definitely done both server- and clientside. You can have a gold loot bag only to be told "You are out of range to loot that corpse."
I think it is highly probable that a seperate cluster controls each of the three major zones. Whether or not there is a need for a front-end box that handles communications among these will have to remain conjecture. I personally don't know why this would be necessary, and in my mind it adds a level of complexity that serves little purpose.
In a crude hypothetical example, let's say that a character on Kalimdor (cluster "K") is hearthing back to Outland (cluster "O"). Our interim communications server is "C". A request could be made directly to "O" from "K" for a transfer of session for this character. Since the request has already been validated, it should never be denied. The cluster is capable of it's own load balancing. So why should there be a box in the middle, creating a useless step? There is no need to go K-C-O, rather than just K-C that I can imagine. If "K" cannot handle the requests from "O" to switch clusters, it certainly can't handle the actual player sessions.
With all of that being said, and with my "Wall of Text" still needing some +crit gear, I defer to anyone with more knowledge than myself on server clustering to explain where the above assumptions might be wrong. At this particular time, I am very interested in learning whatever I can on the trials and tribulations of MMORPG deployment.
And surely there is a shorter, more elegant acronym than MMORPG?