The authentication server architecture seems to be something that has caused them a lot of bother. It seems like they have only one Authentication server per Continent, which causes problems when mass numbers of realms are restarted as the Authentication server bottlenecks. I suspect now they have some kind of distributed architecture for the authentication server as this has been improved in recent times, but the login-spikes are sufficiently huge sometimes to still cause problems.
Blizzard originally went live with JAAS, a Java-based auth service. Sadly I can't find the original reference for that info. It didn't scale wonderfully and I believe they're now on their own custom solution.
The problem is pretty hard. "Given a million logged in North American users on hundreds of realms spread over multiple geographical locations, ensure no-one can log in more than once, expired users can't log in, etc etc". Again, fun stuff to work on.
I personally think this is the more interesting aspect of their operations to learn about. Which server platform/processor/router brand etc. doesn't really say too much without an understanding of exactly how many servers comprise a realm and how they are partitioned.
Someone mentioned Chat as being seperated from continent servers... Which makes me think that perhaps there is some sort of "Master" machine (or process on one of the others, at least) which controls Chat, the Character Selection screen, and the item database. This one would keep track of which continent your character is supposed to be on and when you zone to another. It would also help explain why you're able to stay connected when other continents crash. [NOTE: this is only an educated guess based on observed behavior over the years.]
Realizing these aspects of server structure makes it all the more amusing when the General forums whine about "fix my server" or "upgrade my slow realm"... It's a lot more involved than sitting down at one keyboard and pressing Control-Alt-Delete. Maintenence on "a realm" involves checking the software configuration and wiring for 4-5 boxes... It's no wonder upgrading the hardware last year required 24-hour downtimes - a swarm of technicians had to reorganize something like 40 machines per hosting site, ensuring that each one was configured for the correct function and wired to its sister boxes!
Back to hardware specifics, the processor types and OS don't seem nearly as interesting as wondering what sort of RAM and storage requirements it really takes. Keep in mind that the server doesn't need ANY graphics processing whatsoever, so these figures are probably strikingly small compared to what professional servers are capable of having in them. (I don't know where to start guessing, though.)
All a server really needs to keep in memory is a standard data structure listing characters, mobs, resources, and the coordinates of each. The item database is... well, just an ordinary database. I am oversimplifying here - of course the server has to track interactions between players and mobs, but the math happening inside the CPU is a lot simpler than the sexy 3D graphics at home make it seem.
My educated guess on what the application actually "does" is pretty simple. There's a back-end database which contains your character data, as well as all the other data in the world: mob types, individual mob data, items, saved raid instances, etc. Connecting to that is a "simple" (relatively speaking) application which tracks character and mob movement. It interfaces directly to the clients which run on our desktops. Figuring out exactly which interactions are server-side and which are client-side I believe is mostly a guessing game, though there are some predictable actions that you can assign to each one. Anything which affects the entire world (i.e. things which multiple people can interact with) is server-side. Anything that you control or manipulate is triggered from the client. For instance: Two characters ride up to an herb spawn, and dismount, one gets the tap, the other doesn't. What happens? The server reports to the client that there's an herb at x,y on the map. Your herb radar pops up on your client based on this information which is sent to you (once you're in a certain radius.) The other player gets the same information. Network latency and system load determine who gets the "blip" first. You both run towards it -- this is pure client control. Does the server need to know you're moving? Not really. When you both get there, and dismount, you click the node. Client tells server, "hey, player is at x,y trying to pick up the weed." Server determines if you're in range of it, sends client "OK" and you begin harvesting. Based again on network latency which we all know and love, when the actual event is processed by the server, and sent to the client, is variable. Your bar fills up, client tells server "done picking the weed" and whoever got their "finished" packet back to the server first gets the herbs.
What this all really means is that system load in different parts of the systems affect our experience in different ways. Mobs warping around is usually a network latency problem, or "server" load. (I use "server" to mean the app that interfaces with our desktop clients on the front end, and the database on the back end.) I imagine some component of mob movement is server and database related. If the server can't update positions to the client often enough, they warp. My guess about how much interaction takes place for mob movement is that the server sends a position, a direction, and a speed every once in a while to the client, based on mobs in some arbitrary range. It then lets the client render the movement of that mob until it gets a newer update on where it should be going. This is what causes warping when there is network latency, or server lag. Your client thinks a mob is in a certain spot, but it's really not. So you either aggro it from where you don't expect, or you can't enter combat because you get out of range errors. Loot lag is a database problem. Most likely, the server process made a call to update your character data based on clicking an item in a loot window, and the success response was not forthcoming. Why does it wait? Well, you want to make SURE that when the user feedback indicates the item is looted (the window closes, or some other visual indication) that it ACTUALLY has been looted and the database is updated and saved. Otherwise, you might have lost items. ("HAI GUYZ I LOOT EPIX AND ITZ GONE NOW!!1?!?")
A lot of this seems pretty obvious to me because I work with applications like this every day, just not nearly on this scale. I hope it's given what I can assume to be a more educated audience than the WoW forums crowd at least a little insight, though it really is only an educated guess after all.
I personally think this is the more interesting aspect of their operations to learn about. Which server platform/processor/router brand etc. doesn't really say too much without an understanding of exactly how many servers comprise a realm and how they are partitioned.
As is often the case, the more interesting thing is the harder thing to suss out. There are some things we can say with some confidence based on behavior when things go wrong:
- Instances and battlegrounds are on a separate (set of) server(s) than the major zones in a battlegroup. It's possible that there are numerous machines acting as instance/batteground servers in a battlegroup. (probably depends on population/demand.)
- Authentication is a separate service, probably battlegroup or region-wide. It is probably some sort of Kerberos-like token passing system.
- Chat is also a separate service (though what hardware hosts it is hard to say)
- Each major zone (EK, Kalimdor, Outland) is a separate instance server process, and possibly separate hardware.
- It does not appear that they do failover clustering for instances, including Kalmidor, et. al. (I.E. if they did you could have situation where half your friends in Kalimdor would get dropped but you'd stay online with no significant issue while you're questing in, say, Tanaris. I've never observed this behavior during server crashes.)
- Battlegroups are probably some sort of super-cluster. If they don't share database/storage resources they can probably cross-connect to some degree. (This would be necessary for x-realm BGs, and is probably one of the significant technical issues they had to hurdle to make that feature possible.)
I'm not sure what else you can say with any degree of confidence about the deployment architecture (short of having insider info).
(I.E. if they did you could have situation where half your friends in Kalimdor would get dropped but you'd stay online with no significant issue while you're questing in, say, Tanaris. I've never observed this behavior during server crashes.)
During the first week or two of TBC we did actually see this happening quite a lot on Blackrock. Outland, as expected, was horribly overcrowded and regularly dropping people. Some of the time it was Outland as a whole, a lot of the time however it was just Hellfire Peninsula with those of us that had escaped to Terrokar especially/Zangarmarsh to some degree able to watch others fall offline on a regular basis.
A couple of interesting questions that have had me thinking
1) Does anyone have any idea exactly what Blizzard does during the maintenance day? Sure, its a chance to install new hardware and reboot machines, but I don't beleive new hardware happens every week. Is there the possibility that, eg, WoW leaks memory and it needs regular restarts?
2) A while ago in Silithus there was ressurection bugs, but only for that one zone. Given, as we all seem to aggree, that Kalimdor is one "instance" and one server, what situation would arise that would affect resurrection on just one zone?
On a related note to question (2) above, the concept of different servers for different instances explained why item buffs were like poisons were removed upon zoning in. As character state is essentially "copied" from one machine to another when zoning in, the protocol that did this initially missed out a category of data (item buffs). The fix they put in was inefficient in terms of network traffic, which caused all the bugs during its first two weeks. I think its interesting when you start abstracting back like this as, you get a small handle on the HUGE volumes of data that must fly back and forth when zoning in, if a small badly coded addition could cause so many high traffic issues.
Last edited by Magunsson : 02/21/07 at 10:29 PM.
Reason: additional text
My Ice Block beats your wall of text. I carry on reading.
Originally Posted by Fex
Two characters ride up to an herb spawn.. What happens? The server reports to the client that there's an herb at x,y on the map. ... You both run towards it -- this is pure client control. Does the server need to know you're moving? Not really. ... Client tells server, "hey, player is at x,y trying to pick up the weed."
.
I don't beleive this is correct. One of the fundamental concepts behind distributed programming like this is do not trust the client! .
The WoW server has to assume that the client has been comprimised, and that any data it sends it is suspect. If the client says "my player is at [X, Y]" the server cannot assume that the client caluclated this legitamately by player movement locally. It has to assume the possibility that the user is hacking and just making up data.
So the client machine will send back "my player moves forward at run speed" and the server will map the position of your character. The client will possibly also track the position of your character, to avoid network traffic, but this will be a "best guess" that the client presents to the player. The true position will only be held by the server. This explains the "Jumps" in position that happen during latency.
Other than that, your description is pretty good. As you say, its really a very very large state tracking system. I like your suggestion about how the mob movement works, and I think that sounds pretty accurate. Similar to my comments above, you suggest that the "true" mob position is held by the server, but that the client makes best guesses based on what it knows about mob velocity and pathing, aggro radius etc to save on network traffic.
I don't see how that explains model hacks, however. For example, back on one PTR, Hunters were able to fly by model-swapping their trap models with staircases. They'd walk up their traps, lay another one, and keep walking. Also, another example would be the client-side model hack that allowed people to skip all the way to C'thun.
There was another model that people replaced to walk through walls but I can't remember what it was at the moment.
Programmers can be lazy: I know, cause I am one! Having said that, I think a more likely cause is that its just more efficient to put some trust in the client (even though, as Mr Koster says, the client is the enemy). Not having to ask the server if you're OK to move every time you do it probably helps with that low network requirement.
As it sometimes (always) can be with programming, its making a trade off that works best for your situation. In this case: trust in client vs server power, network capacity, etc.
The client is definately trusted to an extent as far as movement is concerned.
There also seems to be a threshold of how far you have to move before your client decides it is important to tell the server (as rogues sadly experience daily in PvP).
They added some sanity checks after a lot of people started location hacking to warp to bosses in instances and insta cap flags in WSG. If you try most of that stuff now you immediately get disconnected.
1) Does anyone have any idea exactly what Blizzard does during the maintenance day? Sure, its a chance to install new hardware and reboot machines, but I don't beleive new hardware happens every week. Is there the possibility that, eg, WoW leaks memory and it needs regular restarts?
I would guess a fair amount of that time is spent replacing failed hardware, rebuilding drives, deploying OS/application patches/fixes. With the sheer number of drives involved you better believe there is stuff to replace every week.
2) A while ago in Silithus there was ressurection bugs, but only for that one zone. Given, as we all seem to aggree, that Kalimdor is one "instance" and one server, what situation would arise that would affect resurrection on just one zone?
Server in the sense that is being thrown around here does not necessarily mean a single logical entity.
I think what would be of interest as well is how the server software makes optimisations so that data transfer is kept to a minimal level. Inherent with any software system that has a large number of participants is the constant problem of having too much information being calculated, persisted (to the database) or transferred.
One behaviour that I find interesting is mob visibility. Go on to a server when there are no people around and you can see mobs for a fair distance. Go on to a server thats full and your mob viewing distance is quite limited in comparison. Another interesting behaviour is that on a busy continent, in that it takes the server a while to process your location and update mob visibility. You see an extreme version of this when the server halts for a while and you are changing locations, but all you can see is the mobs at your previous location. Then when the server comes to again, it all updates and you find yourself amongst a bunch of mobs.
No doubt the software is designed to keep your view of the world extremely limited as to cut down on the traffic of mob locations to the client, but more so internally. I'm sure my client and my net connection could handle 1000s more mobs, but internally cross checking all the actions between all the mobs and players would be a problem of n squared complexity, which would blow out the computation time by a very large degree.
With regards to player location, this is definitely computed client side, but also double checked for abuse on the server side. Hence why location/speed hacks worked.
My Ice Block beats your wall of text. I carry on reading.
I don't beleive this is correct. One of the fundamental concepts behind distributed programming like this is do not trust the client! .
There is a trade off between how much you can trust a client and the computational load on the servers. Its generally not feasible to have everything of interest done by the server. Certainly not in the case of something that needs to run in real time over a 56k modem. How to divide tasks between the client and server is where lots of fun interesting stuff comes up in distributed systems.
No doubt the software is designed to keep your view of the world extremely limited as to cut down on the traffic of mob locations to the client, but more so internally. I'm sure my client and my net connection could handle 1000s more mobs, but internally cross checking all the actions between all the mobs and players would be a problem of n squared complexity, which would blow out the computation time by a very large degree.
Not to mention that doing so would let you create ShowWoW.
There is a trade off between how much you can trust a client and the computational load on the servers. Its generally not feasible to have everything of interest done by the server. Certainly not in the case of something that needs to run in real time over a 56k modem. How to divide tasks between the client and server is where lots of fun interesting stuff comes up in distributed systems.
Indeed. Anyone that played Ultima Online in its early version will remember how badly movement was because you basically had a server ping/pong for every step. It literally was "(your latency+server processing time+server latency)" for a single step. All online games (even UO got updated) since then has used a client buffer for player movement and only some minor sanity checks to avoid exploitation.
The WoW client verifies a lot of actions client-side, I'd estimate that over 80% of everything you do is handled by your client and never reaches the server. This is required in an environment where you can get over 200 players in the same spot and the server has to synchronize all the actions to everyone.
As for the herbalism example above, it doesn't matter if multiple people clicks the plant at the same time. What matters is the result, which in this case results in only one person gets to open the plant.
This is also true for creating items, they're almost all handled in a transaction (more precisely an SQL transaction), so duplicates are very rare. It's only the occasional fuckup from a programmer that allows people to duplicate stuff. This is also why you often see item lag (loot lag, creation lag, etc.) on congested servers, but movements and actions respond just fine.
One thing I notice is that when there's loot lag going on, you can increment stack size just fine, but as soon as you have to add a new item to your inventory, the lag strikes.
Say I'm mining, for example, and get an adamantium ore, an eternium ore and a gem. The ores go into my pack just fine, as it's just incrementing the stack, but I might have 30 seconds' lag when I loot the gem. Next hit on the vein, say my adamantium stack is full and it needs to start another one - again I'll get the 30 seconds' lag,
Presumably this is because incrementing a stack and taking up a new inventory slot involve different database transactions?
One thing I notice is that when there's loot lag going on, you can increment stack size just fine, but as soon as you have to add a new item to your inventory, the lag strikes.
Say I'm mining, for example, and get an adamantium ore, an eternium ore and a gem. The ores go into my pack just fine, as it's just incrementing the stack, but I might have 30 seconds' lag when I loot the gem. Next hit on the vein, say my adamantium stack is full and it needs to start another one - again I'll get the 30 seconds' lag,
Presumably this is because incrementing a stack and taking up a new inventory slot involve different database transactions?
Could also be that for adding to a stack, you change a single data occurence (number of said item). When you add something new to your inventory, you must add multiple data (item ref, number, etc), as well as get you client to agree it's a known item on its side and then proceed in finding the proper icon/tooltip to display and such.
I don't see how that explains model hacks, however. For example, back on one PTR, Hunters were able to fly by model-swapping their trap models with staircases. They'd walk up their traps, lay another one, and keep walking. Also, another example would be the client-side model hack that allowed people to skip all the way to C'thun.
There was another model that people replaced to walk through walls but I can't remember what it was at the moment.
Collision detection pretty much has to be done client side, otherwise you could maybe have 10 connected clients before the server was unusable because of the huge load.
The server shouldn't trust the client, but sometimes it has to because of limitations, and those limitations are always abused. Who doesn't remember wall hacks for example? They work because the client knows the map, it just doesn't show things behind that wall. But if a program modifies what the client shows you can all of a sudden see through walls and gets a huge advantage.
In a perfect world the server would send the client "Right, this is what you see. Floor there, wall here. A big machine gun over there on the floor." and not tell the client about the hordes of hostile troopers around the corner just waiting for you to get close to the machine gun. No server can however keep up with that load, not even in 10-20 man CS/Quake/UT games, let alone several thousand in MMORPGs.
After that slight derail, collision detection exploits (like the hunter trap) works because collision detection has to be done client side. The server knows next (probably) to nothing about the layout of the world, where there are stairs and where there isn't.
I remember hearing all items had unique ids, which was their guard against duping items. You could track an item anywhere, and if there were 2 or more items with the same item id, you know something went wrong.
How far do you think this goes though? With your stacking example, you'd have to add another item id to the stack, and therefore it would require as much work as creating another stack (I think). Unless there is some special stuff going on with inventory slot checking, but I'm not sure about that.
So the client machine will send back "my player moves forward at run speed" and the server will map the position of your character. The client will possibly also track the position of your character, to avoid network traffic, but this will be a "best guess" that the client presents to the player. The true position will only be held by the server. This explains the "Jumps" in position that happen during latency.
I think there's a mix between this and my guess. Like someone else said in the thread, the server has to trust the client for some things or the server load would be exceptionally high. Some sort of sanity check is probably done at some point if there is an initial "I'm moving now" report from the client, but it's not done constantly.
I.E. if they did you could have situation where half your friends in Kalimdor would get dropped but you'd stay online with no significant issue while you're questing in, say, Tanaris. I've never observed this behavior during server crashes.)
Just to confirm, this is definitely not the case anymore. During the first few days of TBC, Eastern Hellfire Peninsula would crash and drop people out regulary, whilst Western Hellfire Peninsula would remain completely stable - we advised all guildmates to move to that part of the zone to avoid the mass crashing.
Just to confirm, this is definitely not the case anymore. During the first few days of TBC, Eastern Hellfire Peninsula would crash and drop people out regulary, whilst Western Hellfire Peninsula would remain completely stable - we advised all guildmates to move to that part of the zone to avoid the mass crashing.
This could be just differences in population, ie, Client side problems (too much data, can't respond fast enough, server drops you) rather than Server side problems.
I remember hearing all items had unique ids, which was their guard against duping items. You could track an item anywhere, and if there were 2 or more items with the same item id, you know something went wrong.
How far do you think this goes though? With your stacking example, you'd have to add another item id to the stack, and therefore it would require as much work as creating another stack (I think). Unless there is some special stuff going on with inventory slot checking, but I'm not sure about that.
It could be that each stack has a unique id, not each item. This means that a dupe that simply increases the size of a stack could work, but I've never heard of one that does that. Even if you moved the duped items into legal stacks, there'd still be logs of multiple stacks with the same id existing. Stack IDs vs. Item IDs would also explain the significant lag when splitting stacks compared to just moving stacks to a different bag slot. There are definatly things currently tracked by stack, too -- if you have a stack of items with a duration of 1 day and a stack with 2 days, and merge them, you'll have one stack with 2 day duration. Resplit the stacks, and they'll both have 2 day duration -- indicating duration is stored on the stack, not the item.
Just something to go with the tracking via stacks, sometimes I notice that I have a bugged stack of ore which, when prospected, doesn't pop up the loot window. Usually I have to break down the stack and remake it to fix the problem.
Mind you, I'm no programmer, so I'm only relating in-game experiences.
This could be just differences in population, ie, Client side problems (too much data, can't respond fast enough, server drops you) rather than Server side problems.
No, it was definitely server related. There was a line you could draw roughly down the middle of the map, and once you crossed that line, loot lag and serve lag would immediately vanish. It didn't matter how many people were in your particular vicinity at the time, there was a very marked East / West HFP mark.
Of course, it's disingenious to talk of a single server. Each zone and area will be being run on multiple servers simultaneously, and what could have happened was that the zone was split into two for the purposes of server architecture and the western half was put on a different subset of servers compared to the eastern part, in order to ease the strain on the first day. We know, since flying was introduced, that there is no technical reason for there being a few single points of entry between each zone, it's just a design issue, and so there's no reason that HFP couldnt be being run on seperate server sub clusters in order to improve performance.