Back to Reports Archive  previous   next 
The Tenth Annual Game Design Think Tank
Project Horseshoe 2015
horseshoe Group Report:
Ranking and Rating Systems
   
Participants:
Ian Schreiber, Rochester Institute of Technology  
  download the PDF

Brief statement of the problem:

Identifying standard and best practices for rating and ranking systems in games

A brief statement of the solutions:

The group first examined many common ranking and rating systems as currently used in popular games and sports, and then identified the design purposes behind various mechanics within these systems. Also examined were weaknesses of current systems, and suggestions on how to improve them. There is no single “best practice” – rather, ideal ranking/rating system design depends on the nature of a particular game and the design goals for such systems, and how those goals support the game’s core experience.

Expanded problem statement:

“Ranking” and “Rating” are often used interchangeably, but they are different concepts.

  • Rating = numerical approximation of player skill / probability of outcome predictor
  • Ranking = numerical ordering of players among a pecking order (leaderboard, ladder, etc.)

In short, rating is absolute, while ranking is relative. If the number one player in the world dies, everyone else gains +1 rank but does not change their rating.

An incomplete review of rating systems for Chess:

  • Harkness system: designed for Chess tournaments; in a tournament, average rating of competition is calculated. If player goes 50/50 in tournament, their new rating is that average. If above, they score +10 points for each percentage point above 50. Obviously has some mathematical problems (e.g. someone 501 points below the mean can enter a tournament, lose all their games, and go up in rating).
  • Elo system: Elo : Ratings :: C++ : Programming Languages. Most modern systems are Elo or else derived directly from Elo. Elo was initially designed to fix some of the inaccuracies in Harkness.
    • new rating = old rating + K*(W-W_e) where K=10, W = actual tournament score, W_e = expected tournament score, for a tournament.
    • For a single game: R_new = R_old + (K/2)*(W-L+[opp rating – player rating]/C), where W = wins, L = losses, K = 32, C = 200
    • Expected win probability of A = 1 / [1 + 10^((r_b-r_a) / 400)] = Q_A / (Q_A + Q_B) where Q_A = 10^(R_A / 400).
    • New rating = old rating + K*(actual score – expected score), K=16 for master players and 32 for beginners
    • Using the right K is important. K is basically volatility.
  • Glicko: mod of Elo that sets K based on recent activity (the more games played recently, the less volatile)
  • Kyu/Dan system in Go: highly similar to Elo
  • Problems with Elo and its variants:
    • Assumes fair matchmaking/pairing (not true in games where players can select their own opponents)
    • Assumes a bell curve distribution of player skill (not true at the n00b level, even for Chess)
    • Assumes games are based entirely on skill such that an absolute beginner will never beat an expert, with an asymptote of 0 (not true for games with luck components – Garry Kasparov never lost a Chess game because he didn’t draw a Pawn until turn 10)
    • Doesn’t explicitly model a Draw (tournaments often treat a draw as half a win, but that may not be valid in a game where players can force a draw)
    • Only models 1v1 games (extra steps must be taken to extend to multiplayer co-op, teams, or free-for-all)
    • Still widely used, mainly because it’s what everyone is familiar with

Bridge, Tennis, Magic: the Gathering:

  • Masterpoints: Get points for each tournament you play in. Points may be of different types, based on the level of tournament (local vs. national vs. international tournaments). Ratings are based on how many points you collect over a lifetime (sometimes with minimum of N points of higher-level types for higher ratings).
  • M:tG started out this way (“DCI Convocation Points”) then switched to Elo, then switched back to a Masterpoint system (“Planeswalker Points”). With PP, you get points just for showing up to a tournament, in addition to points for winning; clearly an incentivization for participating.
  • Obvious problem: rewards experience and persistence more than just skill. One player might be really great, another might be mediocre but travel to more tournaments, and they would have the same rating. Makes it useless as a predictor of game outcome. Also, ignores partnerships (how much of a win do you “own” vs. your partner, vs. your synergy as partnerSHIP).

TrueSkill™:

  • Player’s skill is given as an average (Mu - mean) and uncertainty (Sigma – standard deviation) characterized as a normal distribution N(Mu,Sigma).
  • Mu always increases for a win and decreases for a loss, by an amount that depends both on player’s AND opponent’s Sigma (high Sigma ratio = larger change – uncertain player moves up or down a lot against a certain one, and the certain one moves very little against uncertain) and how “surprising” the result is (larger change for upsets with large ratings differences between the players).
  • For FFA games, it ranks players by final outcome (so if there are 8 players, and I got a higher score than 5 and lower score than 2, then I’m ranked #3 in that game). In that sense, you “beat” anyone with a lower score and “lose to” anyone with a higher score, and “draw with” anyone with the same score (or sufficiently similar score – threshold defined differently depending on the game). Then you just calculate the Elo-like difference between you and each opponent individually, and update as if you’d played 7 games, basically.
  • For team games, assume team skill is sum of skill of players on that team. Then it calculates the ratings changes for each team, then distributes those points based on individual players’ Sigma (uncertainties).
    • CS:GO: added a “contribution score” (if you were near combat, you were helpful – if nothing else, you’re a bullet sponge). Distributed points based on your helpfulness.

MOBAs, Hearthstone:

  • Matchmaking uses a secret internal rating (MMR), which seems similar to Elo – tries to find opponent with the most similar MMR, and if a suitable opponent can’t be found in a short time, it extends the acceptable window as a function of matchmaking time.
  • LoL: Interestingly, your MMR and rank change based on wins/losses compared to average MMR of opposing team, with no apparent consideration of the MMR of your teammates (!)
  • DOTA2 has separate MMRs for solo, party, team (ranked and unranked are also separate). Also has an uncertainty (sigma) variable. Unclear on difference between “party” and “team” but generally seems to affect playing with high-skilled players and good team synergy/coordination. I *think* party means player’s friends, so they will tend to play together more frequently BUT will have higher skill disparity. (If four players in a party are matched with one in the solo queue, the one will have similar MMR to the average of the other four)
    • Matchmaking also takes experience (# games played) and other things into account.
    • MMR is actually displayed here in some cases.
    • “According to Valve, player opinions of the MMR system are highly correlated with their recent win rates” [ http://blog.dota2.com/2013/12/matchmaking/ ]

Considerations for all rating systems:

  • What is being measured here? Measuring a combination of player activity, player win/loss record, and other things can muddy and confuse them.
  • How does luck-factor fit in? How can you design a system that measures player skill when a good player can lose due to bad luck?
  • Free-for-all multiplayer games: Is your performance a function of the average of your opponents? Algebraic or geometric mean? Or is it dominated by one strong (or one weak) opponent?
  • Team games: Can one strong player carry the rest of the team, or can one weak player wreck everyone else – what is the combined “rating” of the team, does top or bottom dominate or is it an algebraic or geometric average? Does team synergy matter (a team that always plays together and knows each others’ signals vs. a team of disparate experts who haven’t played together before)?
  • What is the role of matchmaking? How do matchmaking systems interact with ranking and rating systems?
  • What about the role of analytics of strategies (should your rating / probability be a function of your deck composition in a TCG)?
  • How to “prove” your system is accurate? Mathematical simulations (or are these tautological)? Analytics of players using the system (how to know the “true” skill level to be able to compare with rank)?
  • What stats should you keep vs. what you display to players?

Expanded solution description:

Rating Systems best practices:

  • I think that ratings should be a predictive probability of the game outcome: i.e. given any set of players and their ratings and other various parameters in the model, if one player would gain X points for a win and lose Y points for a loss, then their probability of winning had better be Y/(X+Y) – note this means if X=0 (no points for a win because you are expected with probability 1 to win) then prob of winning = Y/Y = 1, regardless of Y. If Y=0 (no points for a loss because you are expected to lose, prob 0 to win) then prob of winning is 0/(X+0) = 0, regardless of X.
  • Rankings, on the other hand, can be a function of xp and can be much more granular than ratings. Progression mechanics.
  • Incentivizing play is dangerous because players can game any system. Give “bonus points” for playing same rating and players will play lots of short games, trading wins and losses back and forth for mutual infinite net gain. I think that in general automated matchmaking is mandatory these days – choosing your opponents is too easy to abuse. Strike a balance between time to find a match and difference in skill (50% expected win rate is ideal – leads to the most exciting and uncertain games). Key point: in cases where you have an uneven matchup, try to stagger those such that a player will play about as many favorite as underdog roles.
  • Luck-factor should be the asymptote for the ratings’ predictive probabilities. So if 5% of games are decided because of luck, 2.5% of those are in favor of the favorite and 2.5% in favor of the underdog, so you’d expect an asymptote of a 2.5% win rate for an extreme underdog. This might be a “soft” asymptote or a hard cap.
  • As far as how to deal with multiplayer, it varies based on the game, probably best determined by taking an initial guess and then using analytics (try adding hooks for this during beta so that you have a good guess before release).
  • “Proving” the system is hard. If you make a simulation using expected win rates that match the model in your system then yeah, of course it’s going to work pretty well (though still probably worth doing with RNG to see how many games it takes people to get “sorted” properly, or what the RMS Error is after N games tracked on a graph, etc.) to get a sense of the best case situation. With analytics, what you CAN do is look at % underdog wins overall (especially with those of highly different ratings and similar certainties) and take that to be 0.5x luck factor. Probably some way to look at ALL underdog vs. favorite wins?
  • That said, if a matchmaking system is hidden, a certain amount of inaccuracy is acceptable. As long as players in a MOBA don’t spend interminable time in the queue and feel like they win often enough to justify continued play, having the MMR be a 100% accurate mathematical model isn’t necessary. Players care far more about what is displayed (i.e. ranking) in this case.

Other avenues of exploration for Ratings:

  • Taking into account how much a player wins by. A win by a large margin implies a greater degree of skill.
  • Considering the amount of time a player takes to win, and player performance across time. A player who is obviously stronger and wants the other player to not feel bad about being destroyed may intentionally play suboptimally in order to win consistently but by a narrow margin.
  • Using self-selected player types in automated matchmaking, and not just rating: whether they play casually or competitively, whether they like trash-talk, etc. (not as “moral values” but just as a “what’s your gamer type” thing that’s non-judgmental). Then match players against others who share their preferences.
  • Occasionally, intentionally giving a player a much weaker opponent in order to maintain engagement (e.g. when a player has been on a loss streak). Could even match them against a bot that’s posing as a real player but programmed to lose (an “Ashley-Madison AI”).

Ranking Systems:

Ways to show rank:

  • High granularity (#234 of 175938) vs. Low granularity (Bronze/Silver/Gold/Platinum)
  • Global (all players) vs. nearby (global list culled by +/-30 near you) vs. local (within your guild, within your friends list, within your city)
  • Permanent (tied to lifetime rating) vs. temporary (unrelated to rating, reset every “season”)

Ways to go up or down in rank:

  • Ladder system (challenge above you, go up on a win). Lots of ways to do this.
    • Pro: immediately gives the idea of “climbing the ladder” i.e. progression; no risk (you can go up or stay where you are when you challenge, falling is invisible to you)
    • Con: only one player can be #1, demoralizing to be on a treadmill where you must “advance” just to maintain current position.
  • Stars (Hearthstone). Go up or down from wins/losses. Matchmaking done by MMR (rating) independent from rank.
    • Ironically, leads to a situation where to advance, players would like to have a low MMR relative to their actual skill.
  • Low-granularity display of rating. Go up by going up enough in rating (hard caps or percentiles)
  • Subjective (used in some pro sports) vs. objective (more common).

Potential design purposes of rank (important to know what it’s being used for to design it right):

  • Progression system. Rating isn’t progression because it only increases when player gets better at the game, which is a slow process. (Regular resets such as “season play” help with this. Faster progression for winning, slower or no drops for losing.)
  • Dominance system. Lets you know how many players you are better (or worse) than. Down side: only one player can be #1 (local ranks expand this, as can low-granularity systems: knowing you’re somewhere in the top 5%)
  • Reward system. Option to give in-game loot for reaching milestones in the progression system. (Dovetails well with regular resets for regular rewards, otherwise ends up as a sort of “tutorial” one-shot series of rewards that start off fast and slow down over time)
  • Can give you a better chance of where you stand if you’re good (convert an abstract “rating” number into a more meaningful percentile) and/or mask how bad you are if you’re bad (you just gained a level, you’re awesome, let’s not worry about the fact that you’re in the bottom 5%). So, ironically can either clarify or mystify ratings.

Problems and Future Work:

  • Mean rating among all players may increase or decrease over time, based on the system, leading to long-term ratings inflation or deflation (example: if there is a “rating floor” below which a player can’t fall, but a player winning against a player at the floor still gains points, that’s inflationary). For seasonal play where ratings are reset, or hidden ratings, this may not be much of an issue. For permanent ratings that are displayed to players (e.g. Chess) this is something you will want to keep to a minimum in either direction (inflation punishes new players, deflation punishes old players).
  • Dealing with players disconnecting (intentionally or otherwise) is a hard problem. Don’t punish and players will intentionally disconnect to avoid a loss. Treat a disconnect as a loss and players who aren’t lucky enough to have perfectly stable internet will hate your game. Telling intentional from unintentional disconnects is hard; if players connect directly to each other during the game instead of playing through your server it can be impossible (even if players who lose connection to their game attempt reconnection to the server, players may try to reconnect their internet just at the time when the game drops and server connection is reestablished in order to look honest). Players connecting directly to each other also opens the door to one player using a DoS attack on their opponent to get a cheap win. At the very least, make it easy to go through games after-the-fact looking for patterns (e.g. players with suspiciously high personal or opponent disconnect rate).
  • Cheating is also a big deal. If you have any kind of rating or ranking that is exposed to players, they will cheat. If it is used as any kind of meaningful metric (invitational tournaments to top ranks, top-ten list displayed to all players, etc.) they will cheat a lot. Expect to spend a disproportionate amount of time dealing with cheaters.

section 11


next section

select a section:
1. Introduction
2. Workgroup Reports Overview
3. Generative Systems, Meaningful Cores
4. 7 Amazing Things You Can Do With Words: Qualities of a Massively Popular, Successful Text Experience
5. Of Minds and Mobs: Game Design for Shared Avatars and Other Weird Collectives
6. Designing Games for the Growing 35+ Market
7. Creating Emotionally Safe Workplaces in Game Development
8. The Impending Singularity and How to Use It
9. Exploring Metagames and Metagame Systems
10. Contrary Game Design: Subverting Player Expectations
11. Ranking and Rating Systems
12. Augmented Reality Theater As An Entertainment Destination
13. Best Practices for Design to Communicate with Other Disciplines
14. Obscene Player Names in Online Games
15. Schedule & Sponsors