Approximation Schemes for Stochastic Mean Payoff Games with Perfect Information and Few Random Positions

We consider two-player zero-sum stochastic mean payoff games with perfect information. We show that any such game, with a constant number of random positions and polynomially bounded positive transition probabilities, admits a polynomial time approximation scheme, both in the relative and absolute sense.


Introduction
The rise of the Internet has led to an explosion in research in game theory, the mathematical modeling of competing agents in strategic situations. The central concept in such models is that of a Nash equilibrium, which defines a state where no agent gains an advantage by changing to another strategy. Nash equilibria serve as predictions for the outcome of strategic situations in which selfish agents compete.
A fundamental result in game theory states that if the agents can choose a mixed strategy (i.e., probability distributions of deterministic strategies), a Nash equilibrium is guaranteed to exist in finite games [24,25]. Often, however, already pure (i.e., deterministic) strategies lead to a Nash equilibrium. Still, the existence of Nash equilibria might be irrelevant in practice since their computation would take too long (finding mixed Nash equilibria in two-player games is PPAD-complete in general [11]). Thus, algorithmic aspects of game theory have gained a lot of interest. Following the dogma that only polynomial time algorithms are feasible algorithms, it is desirable to show polynomial time complexity for the computation of Nash equilibria.
We consider two-player zero-sum stochastic mean payoff games with perfect information. In this case the concept of Nash equilibria coincides with saddle points or mini-max/maxi-min strategies. The decision problem associated with computing such strategies and the values of these games is in the intersection of NP and co-NP, but it is unknown whether it can be solved in polynomial time. In cases where efficient algorithms are not known to exist, an approximate notion of a saddle point has been suggested. In an approximate saddle point, no agent can gain a substantial advantage by changing to another strategy. In this paper, we design approximation schemes for saddle points for such games when the number of random positions is fixed (see Sect. 1.2 for a definition).
In the remainder of this section, we introduce the concepts used in this paper. Our results are summarized in Sect. 1.4. After that, we present our approximation schemes (Sect. 2). We conclude with a list of open problems (Sect. 3), where we address in particular the question of polynomial smoothed complexity of mean payoff games. In the conference version of this paper [2], we wrongly claimed that stochastic mean payoff games can be solved in smoothed polynomial time.

Definition and Notation
The model that we consider is described is a stochastic mean payoff game with perfect information, or equivalently a BWR-game G = (G, P, r ): Starting from some vertex v 0 ∈ V , a token is moved along one edge e in every round of the game. If the token is on a black vertex, Black selects an outgoing edge e and moves the token along e. If the token is on a white vertex, then White selects an outgoing edge e. In a random position v ∈ V R , a move e = (v, u) is chosen according to the probabilities p vu of the outgoing edges of v. In all cases, Black pays White the reward r e on the selected edge e.
Starting from a given initial position v 0 ∈ V , the game yields an infinite walk (v 0 , v 1 , v 2 , . . .), called a play. Let b i denote the reward r (v i−1 ,v i ) received by White in step i. The undiscounted limit average effective payoff is defined as the Cesàro average c = lim inf n→∞ White's objective is to maximize c, while the objective of Black is to minimize it.
In this paper, we will restrict our attention to the sets of pure (that is, nonrandomized) and stationary (that is, history-independent) strategies of players White and Black, denoted by S W and S B , respectively; such strategies are called positional strategies. Formally, a positional strategy s W ∈ S W for White is a mapping that assigns a move (v, u) ∈ E to each position in V W . We sometimes abbreviate s W (v) = (v, u) by s W (v) = u. Strategies s B ∈ S B for Black are analogously defined. A pair of strategies s = (s W , s B ) is called a situation. By abusing notation, let Given a BWR-game G = (G, P, r ) and a situation s = (s B , s W ), we obtain a weighted Markov chain G(s) = (G(s) = (V, E(s)), P(s), r ) with transition matrix P(s) defined in the obvious way: Here, E(s) = {e = (v, u) ∈ E | p vu (s) > 0} is the set of arcs with positive probability. Given an initial position v 0 ∈ V from which the play starts, we define the limiting (mean) effective payoff c v 0 (s) in G(s) as In what follows, we will use (G, v 0 ) to denote the game starting from v 0 . We will simply write ρ(s) for ρ(s, v 0 ) if v 0 is clear from the context. For rewards r : E → R, let r − = min e r e and r + = max e r e . Let [r ] = [r − , r + ] be the range of r . Let R = R(G) = r + − r − be the size of the range.

Strategies and Saddle Points
If we consider c v 0 (s) for all possible situations s, we obtain a matrix game It is known that every such game has a saddle point in pure strategies [19,29]. Such a saddle point defines an equilibrium state in which no player has an incentive to switch to another strategy. The value at that state coincides with the limiting payoff in the corresponding BWR-game [19,29]. We call a pair of strategies optimal if they correspond to a saddle point. It is wellknown that there exist optimal strategies (s * W , s * B ) that do not depend on the starting position v 0 . Such strategies are called uniformly optimal. Of course there might be several optimal strategies, but they all lead to the same value. We define this to be the value of the game and write is any pair of optimal strategies. Note that μ v 0 (G) may depend on the starting node v 0 . Note also that for an arbitrary situation s, μ v 0 (G(s)) denotes the effective payoff c v 0 (s) in the Markov chain G(s).
An algorithm is said to solve the game if it computes an optimal pair of strategies.

Approximation and Approximate Equilibria
Given a BWR-game G = (G = (V, E), P, r ), a constant ε > 0, and a starting position v ∈ V , an ε-relative approximation of the value of the game is determined by a situation (s * (1) An alternative concept of an approximate equilibrium are ε-relative equilibria. They are determined by a situation (s * W , s * B ) such that and min Note that, for sufficiently small ε, an ε-relative approximation implies a (ε)-relative equilibrium, and vice versa. Thus, in what follows, we will use these notions interchangeably. When considering relative approximations and relative equilibria, we assume that the rewards are non-negative integers. An alternative to relative approximations is to look for an approximation with an absolute error of ε; this is achieved by a situation (s * Similarly, for an ε-absolute equilibrium, we have the following condition: Again, an ε-absolute approximation implies a 2ε-absolute equilibrium, and vice versa. When considering absolute equilibria and absolute approximations, we assume that the rewards come from the interval is called relatively ε-optimal, if satisfies (1), and it is called absolutely ε-optimal if it satisfies (3). In the following, we will drop the specification of absolute and relative if it is clear from the context. If the pair (s * W , s * B ) is (absolutely or relatively) ε-optimal for all starting positions, it is called uniformly (absolutely or relatively) ε-optimal (also called subgame perfect).
We note that, under the above assumptions, the notion of relative approximation is stronger. Indeed, consider a BWR-game G with rewards in [−1, 1]. A relatively ε-optimal situation (s * W , s * B ) of the gameĜ with local rewards given byr = r + 1 ≥ 0 (where 1 is the vector of all ones, and the addition and comparison is meant componentwise) satisfies This is because μ v (Ĝ(s)) = μ v (G(s)) + 1 for any situation s and μ v (G) ≤ 1. Thus, we obtain a 2ε-absolute approximation for the value of the original game. An algorithm for approximating (absolutely or relatively) the values of the game is said to be a fully polynomial-time (absolute or relative) approximation scheme (FPTAS) if the running-time depends polynomially on the input size and 1/ε. In what follows, we assume without loss of generality that 1/ε is an integer.

Previous Results
BWR-games are an equivalent formulation [21] of the stochastic games with perfect information and mean payoff that were introduced in 1957 by Gillette [19]. As it was noticed already in [21], the BWR model generalizes a variety of games and problems: BWR-games without random positions (V R = ∅) are called cyclic or mean payoff games [16,17,21,33,34]; we call these BW-games. If one of the sets V B or V W is empty, we obtain a Markov decision process for which polynomial-time algorithms are known [32]. If both are empty (V B = V W = ∅), we get a weighted Markov chain.
, we obtain the minimum mean-weight cycle problem, which can be solved in polynomial time [27].
If all rewards are 0 except for m terminal loops, we obtain the so-called Backgammon-like or stochastic terminal payoff games [7]. The special case m = 1, in which every random node has only two outgoing arcs with probability 1/2 each, defines the so-called simple stochastic games (SSGs), introduced by Condon [13,14]. In these games, the objective of White is to maximize the probability of reaching the terminal, while Black wants to minimize this probability. Recently, it has been shown that Gillette games (and hence BWR-games [3]) are equivalent to SSGs under polynomialtime reductions [1]. Thus, by recent results of Halman [22], all these games can be solved in randomized strongly subexponential time Besides their many applications [26,30], all these games are of interest to complexity theory: The decision problem "whether the value of a BW-game is positive" is in the intersection of NP and co-NP [28,40]; yet, no polynomial algorithm is known even in this special case. We refer to Vorobyov [39] for a survey. A similar complexity claim holds for SSGs and BWR-games [1,3]. On the other hand, there exist algorithms that solve BW-games in practice very fast [21]. The situation for these games is thus comparable to linear programming before the discovery of the ellipsoid method: linear programming was known to lie in the intersection of NP and co-NP, and the simplex method proved to be fast in practice. In fact, a polynomial algorithm for linear programming in the unit cost model would already imply a polynomial algorithm for BW-games [37]; see also [4] for an extension to BWR-games.
While there are numerous pseudo-polynomial algorithms known for BW-games [21,35,40], pseudo-polynomiality for BWR-games (with no restriction on the number of random positions) is in fact equivalent to polynomiality [1]. Gimbert and Horn [20] have shown that a generalization of simple stochastic games on k random positions having arbitrary transition probabilities [not necessarily (1/2, 1/2)] can be solved in time O(k!(|V ||E| + L)), where L is the maximum bit length of a transition probability. There are various improvements with smaller dependence on k [9,15,20,23] (note that even though BWR-games are polynomially reducible to simple stochastic games, under this reduction the number of random positions does not stay constant, but is only polynomially bounded in n, even if the original BWRgame had a constant number of random positions). Recently, a pseudo-polynomial algorithm was given for BWR-games with a constant number of random positions and polynomial common denominator of transition probabilities, but under the assumption that the game is ergodic (that is, the value does not depend on the ini-tial position) [5]. Then, this result was extended for the non-ergodic case [6]; see also [4].
As for approximation schemes, the only result we are aware [36] of is the observation that the values of BW-games can be approximated within an absolute error of ε in polynomial-time, if all rewards are in the range [−1, 1]. This follows immediately from truncating the rewards and using any of the known pseudo-polynomial algorithms [21,35,40].
On the negative side, it was observed recently [18] that obtaining an ε-absolute FPTAS without the assumption that all rewards are in [−1, 1], or an ε-relative FPTAS without the assumption that all rewards are non-negative, for BW-games, would imply their polynomial time solvability. In that sense, our results below are the best possible unless there is a polynomial algorithm for solving BW-games.

Our Results
In this paper, we extend the absolute FPTAS for BW-games [36] in two directions. First, we allow a constant number of random positions, and, second, we derive an FPTAS with a relative approximation error. Throughout the paper, we assume the availability of a pseudo-polynomial algorithm A that solves any BWR-game G with integral rewards and rational transition probabilities in time polynomial in n, D, and is the size of the range of the rewards, r + (G) = max e r e and r − (G) = min e r e , and D = D(G) is the common denominator of the transition probabilities. Note that the dependence on D is inherent in all known pseudo-polynomial algorithms for BWRgames. Note also that the affine scaling of the rewards does not change the game.
Let p min = p min (G) be the minimum positive transition probability in the game G. Throughout this paper, we will assume that the number k of random positions is bounded by a constant.
The following theorem says that a pseudo-polynomial algorithm can be turned into an absolute approximation scheme. (1), and for any ε > 0, a pair of strategies that (uniformly) approximates the value within an absolute error of ε. The running-time of the algorithm is bounded by

Theorem 1 Given a pseudo-polynomial algorithm for solving any BWR-game with k = O(1) (in uniformly optimal strategies), there is an algorithm that returns, for any given BWR-game with rewards in
We also obtain an approximation scheme with a relative error. We remark that Theorem 1 (apart from the dependence of the running time on log R) can be obtained from Theorem 2 (see Sect. 2). However, our reduction in Theorem 1, unlike Theorem 2, has the property that if the pseudo-polynomial algorithm returns uniformly optimal strategies, then the approximation scheme also returns uniformly ε-optimal strategies. For BW-games, i.e., the special case without random positions, we can also strengthen the result of Theorem 2 to return a pair of strategies that is uniformly ε-optimal. Theorem 3 Assume that there is a pseudo-polynomial algorithm for solving any BWgame in uniformly optimal strategies. Then for any ε > 0, there is an algorithm that returns, for any given BW-game with non-negative integral rewards, a pair of uniformly relatively ε-optimal strategies. The running-time of the algorithm is bounded by poly(n, log R, 1/ε).
In deriving these approximation schemes from a pseudo-polynomial algorithm, we face two main technical challenges that distinguish the computation of ε-equilibria of BWR-games from similar standard techniques used in combinatorial optimization. First, the running-time of the pseudo-polynomial algorithm depends polynomially both on the maximum reward and the common denominator D of the transition probabilities. Thus, in order to obtain a fully polynomial-time approximation scheme (FPTAS) with an absolute guarantee whose running-time is independent of D, we have to truncate the probabilities and bound the change in the game value, which is a non-linear function of D. Second, in order to obtain an FPTAS with a relative guarantee, one needs (as often in optimization) a (trivial) lower/upper bound on the optimum value. In the case of BWR-games, it is not clear what bound we can use, since the game value can be arbitrarily small. The situation becomes even more complicated if we look for uniformly ε-optimal strategies. This is because we have to output just a single pair of strategies that guarantees ε-optimality from any starting position.
In order to resolve the first issue, we analyze the change in the game values and optimal strategies if the rewards or transition probabilities are changed. Roughly speaking, we use results from Markov chain perturbation theory to show that if the probabilities are perturbed by a small error δ, then the change in the game value is O(δn 2 / p 2k min ) (see Sect. 2.1). It is worth mentioning that a somewhat related result was obtained recently for the class of so-called almost-sure ergodic games (not necessarily with perfect information) [10]. More precisely, it was shown that for this class of games there is an ε-optimal strategy with rational representation with denominator The second issue is resolved through repeated applications of the pseudo-polynomial algorithm on a truncated game. After each such application we have one of the following situations: either the value of the game has already been approximated within the required accuracy or it is guaranteed that the range of the rewards can be shrunk by a constant factor without changing the value of the game (see Sects. 2.3, 2.4).
Since BWR-games with a constant number of random positions admit a pseudopolynomial algorithm, as was recently shown [5,6], we obtain the following results.

Corollary 1 (i)
There is an FPTAS that solves, within an absolute error guarantee, in uniformly ε-optimal strategies, any BWR-game with a constant number of random positions, 1/ p min = poly(n), and rewards in [−1, 1].
(ii) There is an FPTAS that solves, within a relative error guarantee, in ε-optimal strategies, any BWR-game with a constant number of random positions, 1/ p min = poly(n), and non-negative rational rewards. (iii) There is an FPTAS that solves, within a relative error guarantee, in uniformly ε-optimal strategies, any BW-game with non-negative (rational) rewards.

The Effect of Perturbation
Our approximation schemes are based on the following three lemmas. The first one (which is known) says that a linear change in the rewards corresponds to a linear change in the game value. In our approximation schemes, we truncate and scale the rewards to be able to run the pseudo-polynomial algorithm in polynomial time. We need the lemma to bound the error in the game value resulting from the truncation.
Proof This uses only standard techniques, and we give the proof only for completeness. Let (s * W , s * B ) and (ŝ W ,ŝ B ) be pairs of optimal strategies for (G, v) and (Ĝ, v), respectively. Denote by ρ * ,ρ, ρ , and ρ the (arc) limiting distributions for the Markov chains starting from v 0 and corresponding to pairs , respectively. By the definition of optimal strategies and the facts that ρ 1 = ρ 1 = 1 (because they are probability distributions), we have the following series of inequalities: To see the first bound in (5), note that for any s W , we have (5) follows. The second bound can be shown similarly.
The second lemma, which is new as far as we are aware, states that if we truncate the transition probabilities within a small error ε, then the change in the game value is bounded by O(ε 2 n 3 / p 2k min ). More precisely, for a BWR-game G and a constant ε > 0, define Proof We apply Lemma 10. Let (s * W , s * B ) and (ŝ W ,ŝ B ) be pairs of optimal strategies for (G, v) and (Ĝ, v), respectively. Write δ = δ(G, ε). Then optimality and Lemma 10 imply the following two series of inequalities: To see the second claim, note that for any s W ∈ S W , we have Similarly, we can show that Since we assume that the running-time of the pseudo-polynomial algorithm for the original game G depends on the common denominator D of the transition probabilities, we have to truncate the probabilities to remove this dependence on D. By Lemma 2, the value of the game does not change too much after such a truncation.
The third result that we need concerns relative approximation. The main idea is to use the pseudo-polynomial algorithm to test whether the value of the game is larger than a certain threshold. If it is, we get already a good relative approximation. Otherwise, the next lemma says that we can reduce all large rewards without changing the value of the game. Proof We assume thatr e = t ≥ (1 + ε)t , since otherwise there is nothing to prove. Let s * = (s * W , s * B ) be an optimal situation for (G, v). This means that μ v (G) = μ v (G(s * )) = ρ(s * ) T r < t. Lemma 8 says that ρ e (s * ) > 0 implies ρ e (s * ) ≥ p 2k+1 min /n. Hence, r e ρ e (s * ) ≤ ρ(s * ) T r = μ v (G) < t implies that r e < t , if ρ e (s * ) > 0. We conclude that ρ e (s * ) = 0, and hence μ v (Ĝ(s * )) = μ v (G).
Sincer ≤ r , we have μ v (Ĝ(s)) ≤ μ v (G(s)) for all situations s. In particular, for any s W ∈ S W , We claim that also then, by the same argument as above, we must have ρ e (s * W , This, however, implies that which is in contradiction to the optimality of s * in G. We conclude that (s * W , s * B ) is also optimal inĜ and hence gives a contradiction with Lemma 8 if ρ e (s W ,ŝ B ) > 0. It follows that, for any

Absolute Approximation
In this section, we assume that r − = −1 and r + = 1, i.e., all rewards are from the interval [−1, 1]. We may assume also that ε ∈ (0, 1) and 1 ε ∈ Z + . We apply the pseudo-polynomial algorithm A on a truncated gameG = (G = (V, E),P,r ) defined by rounding the rewards to the nearest integer multiple of ε/4 (denotedr := r ε 4 ) and truncating the vector of probabilities ( p (v,u) ) u∈V for each random node v ∈ V R , as described in the following lemma.
Proof This is straight-forward, and we include the proof only for completeness. Without loss of generality, we assume α i > 0 for all i (set α i = 0 for all i such that α i = 0). Initialize ε 0 = 0 and iterate for i = 1, . . . , n: set which implies (4). Note finally that (4) follows from (4) since min i:
Note that the above technique yields an approximation algorithm with polynomial running-time only for k = O(1), even if the pseudo-polynomial algorithm A works for arbitrary k.

Relative Approximation
Let G = (G, P, r ) be a BWR-game on G with non-negative integral rewards, that is, r − = 0 and min e:r e >0 r e ≥ 1. The algorithm is given as Algorithm 1. The main idea is to truncate the rewards, scaled by a certain factor of 1/K , and use the pseudopolynomial algorithm on the truncated gameĜ. If the value μ w (Ĝ) in the truncated game from the starting node w is large enough (step 4), then we get a good relative approximation of the original value and we are done. Otherwise, the information that μ w (Ĝ) is small allows us to reduce the maximum reward by a factor of 2 in the original game (step 9); we invoke Lemma 3 for this. Thus, the algorithm terminates Data: a BWR-game G = (G = (V, E), P, r ), a starting vertex w ∈ V , and an accuracy ε Result: an ε-optimal pair (s W ,s B ) for the game (G, w) in polynomial time (in the bit length of R(G)). To remove the dependence on D in the running-time, we need also to truncate the transition probabilities. In the algorithm, we denote byP the transition probabilities obtained from P by applying Lemma 4 with Proof The algorithm FPTAS-BWR(G, w, ε) is given as Algorithm 1. The bound on the running-time follows since, by step (9), each time we recurse on a gameG with r + (G) reduced by a factor of at least half. Moreover, the rewards in the truncated gamê G are non-negative integers with a maximum value of r + (Ĝ) ≤ θ , and the smallest common denominator of the transition probabilities is at mostD := 2 ε . Thus the time taken by algorithm A for each recursive call is at most τ n,D, θ).

Lemma 6 Let
What remains to be done is to argue by induction (on r + (G)) that the algorithm returns a pairs = (s W ,s B ) of ε-optimal strategies. For the base case, we have either r + (G) ≤ 2 or the value returned by the pseudo-polynomial A satisfies μ w (Ĝ) ≥ 3/ε. In the former case, note that since P −P ∞ ≤ ε and r + (G) ≤ 2, Lemma 2 implies that the pairs = (s W ,s B ) returned in step 2 is absolutely ε -optimal, where ε = 2δ(G, ε ) < εp 2k+1 min n . Lemma 8 and the integrality of the non-negative rewards imply that, for any situation s, μ w (G(s)) ≥ p 2k+1 min /n if μ w (G(s)) > 0. Thus, if μ w (G) > 0, then ε ≤ εμ w (G), and it follows that (s W ,s B ) is relatively ε-optimal.
Suppose now that A determines that μ w (Ĝ) ≥ 3/ε in step 4, and hence the algorithm returns (s W ,s B ). Note that 1 K ·r e − 1 ≤r e ≤ 1 K ·r e for all e ∈ E, and P −P ∞ ≤ ε . Hence, by Lemmas 1 and 2, we have and the pair (s W ,s B ) returned in step 5 is absolutely K + 2δ(G, ε ) ≤ 2K -optimal for G.
Note that the running-time in the above lemma simplifies to poly(n, 1/ε, 1/ p min ) · log R for k = O(1).

Uniformly Relative Approximation for BW-Games
The FPTAS in Theorem 6 does not necessarily return a uniformly ε-optimal situation, even if the given pseudo-polynomial algorithm A provides a uniformly optimal solution. For BW-games, we can modify this FPTAS to return a uniformly ε-optimal situation. The algorithm is given as Algorithm 2. The main difference is that when we recurse on a game with reduced rewards (step 11), we also have to delete all positions that have large values μ v (G) in the truncated game. This is similar to the approach used to decompose a BW-game into ergodic classes [21]. However, the main technical difficulty is that, with approximate equilibria, White or Black might still have some incentive to move to a lower-or higher-value class, respectively, since the values Data: a BW-game G = (G = (V = V B ∪ V W , E), r ), and an accuracy ε Result: a uniformly ε-optimal pair (s W ,s B ) for G 1 if r + (G) ≤ 2 then 2 return A(G) 3 end obtained are just approximations of the optimal values. We show that such a move will not be much profitable for either White nor for Black. Recall that we assume that the rewards are non-negative integers. τ (n, R). Then for any ε > 0, there is an algorithm that solves, in uniformly relatively ε-optimal strategies, any BW-game G, in time O τ n, 2(1+ε ) 2 n ε + poly(n) · h , where h = log R + 1, and ε = ln(1+ε) 3h ≈ ε 3h .

Lemma 7 Let A be a pseudo-polynomial algorithm that solves, in uniformly optimal strategies, any BW-game G in time
Proof The algorithm FPTAS-BW(G, ε) is given as Algorithm 2. The bound on the running-time is obvious: in step (9), each time we recurse on a gameG with r + (G) reduced by a factor of at least half. Moreover, the rewards in the truncated gameĜ are integral with a maximum value of r + (Ĝ) ≤ r + (G) . Thus, the time that algorithm A needs in each recursive call is bounded from above by τ n, 2(1+ε ) 2 n ε . So it remains to argue (by induction) that the algorithm returns a pair (s W ,s B ) of (relatively) uniformly ε-optimal strategies. Let us index the different recursive calls of the algorithm by i = 1, 2, . . . , h ≤ h and denote by G ( the game input to the ith recursive call of the algorithm (so G (1) ) the pair of strategies returned either in steps 2, 4, 5, or 11. Similarly, we denote by
Note that Claim 1 implies that the game Proof This follows from Lemma 1 by the uniform optimality ofŝ (i) inĜ (i) and the fact that μ w (Ĝ (i) ) ≥ 1/ε for every w ∈ U (i) .

Claim 3 For all u
Proof For u ∈ U (i) , v ∈ V (i) \U (i) , we have μ u (Ĝ (i) ) ≥ 1/ε and μ v (Ĝ (i) ) < 1/ε . Thus, by Lemma 1, We observe that the strategys (i) , returned by the ith call to the algorithm, is determined as follows (c.f. steps 11 and 11): for w ∈ U (i) ,s (i) (w) =ŝ (i) (w) is chosen by the solution of the gameĜ (i) , and for w ∈ V (i) \U (i) ,s (i) (w) is determined by the (recursive) solution on the residual gameG (i) = G (i+1) . The following claim states that the value of any vertex u ∈ V (i) \U (i) in the residual game is a good (relative) approximation of the value in the original game G (i) .

Claim 4
For all i = 1, . . . , h and any u ∈ V (i) \U (i) , we have , u), respectively. Let us extends to a situation in G (i) by settings(v) =ŝ (i) (v) for all v ∈ U (i) , whereŝ is the situation returned in by the pseudo-polynomial algorithm step 4. Then, by Claim 2.4(i'), White has no way to escape to U (i) , or in other words, s * For similar reasons, and v is reachable from u in the graph G(s * W , s * B ). Then (by Lemma 1) μ u (G (i) ) = μ u (G (i) ) ≥ K (i) μ u (Ĝ (i) ) ≥ K (i) ε . Moreover, the optimality of (ŝ W ,ŝ B ) inĜ (i) and the fact that 1 and In particular, Let us fix ε h = ε , and for i = h − 1, h − 2, . . . , 1, let us choose ε i such that 1 + ε i ≥ (1 + ε )(1 + 2ε )(1 + ε i+1 ). Next, we claim that the strategies (s by reducing the rewards according to step 9. Thus, Lemma 3 yields that μ v (Ḡ (i) ) = μ v (G (i) ), and hence, Proof of (11): Consider an arbitrary strategy s W ∈ S (i) W for White. Suppose first that w ∈ U (i) . Note that, by Claim 1(iii),s Suppose therefore that v = s W (u) / ∈ U (i) for some u ∈ V W ∩ U (i) such that u is reachable from w in the graph G(s W ,s

Note thats
. Thus, we get the following series of inequalities: The equality holds since v is reachable from w in the graph G(s W ,s (i) B ); the first inequality holds by (13); the second inequality holds because of (10); the third one follows from Claim 3; the fourth inequality holds since If w ∈ V (i) \U (i) , then a similar argument as in (15) and (16) shows that . Thus, (11) follows.
Proof of (12): Consider an arbitrary strategy s B ∈ S B \U (i) , then we get by (14) and (10) that v is reachable from w in the graph G(s where the last inequality follows from the fact that, for all i = 1, . . . , Finally, to finish the proof of Lemma 7, we set the ε i 's and ε such that Note that our choice of ε = ln(1+ε) 3h satisfies this as

Concluding Remarks
In this paper, we have shown that computing the game values of classes of stochastic mean payoff games with perfect information and a constant number of random positions admits approximation schemes, provided that the class of games at hand can be solved in pseudo-polynomial time.
To conclude this paper, let us raise a number of open questions: 1. First, in the conference version of this paper [2], we claimed that, up to some technical requirements, a pseudo-polynomial algorithm for a class of stochastic mean payoff games implies that this class has polynomial smoothed complexity (smoothed analysis is a paradigm to analyze algorithms with poor worst-case and good practical performance. Since its invention, it has been applied to a variety of algorithms and problems to explain their performance or complexity, respectively [31,38]). However, the proof of this result is flawed. In particular, the proof of a lemma that is not contained in the proceedings version, but only in the accompanying technical report (Oberwolfach Preprints, OWP 2010-22, Lemma 4.3) is flawed.
The reason for this is relatively simple: If we are just looking for an optimal solution, then we can show that the second-best solution is significantly worse than the best solution. For two-player games, where one player maximizes and the other player minimizes, we have an optimization problem for either player, given an optimal strategy of the other player. However, the optimal strategy of the other player depends on the random rewards of the edges. Thus, the two strategies are dependent. As a consequence, we cannot use the full randomness of the rewards to use an isolation lemma to compare the best and second-best response to the optimal strategy of the other player. Therefore, the question, whether stochastic mean payoff games have polynomial smoothed complexity, remains open. 2. In Sect. 2.3 we gave an approximation scheme that relatively approximates the value of a BWR-game from any starting position. If we apply this algorithm from different positions, we are likely to get two different relatively ε-optimal strategies. In Sect. 2.4 we have shown that a modification of the algorithm in Sect. 2.3 yields a uniformly relatively ε-optimal strategies when there are no random positions. It remains an interesting question whether this can be extended to BWR-games with a constant number of random positions. 3. Is it true that pseudo-polynomial solvability of a class of stochastic mean payoff games implies polynomial smoothed complexity? In particular, do mean payoff games have polynomial smoothed complexity? 4. Related to Question 3: is it possible to prove an isolation lemma for (classes of) stochastic mean payoff games? We believe that this is not possible and that different techniques are required to prove smoothed polynomial complexity of these games. 5. While stochastic mean payoff games include parity games as a special case, the probabilistic model that we used here does not make sense for parity games. However, parity games can be solved in quasi-polynomial time [8]. One wonders if they also have polynomial smoothed complexity under a reasonable probabilistic model. 6. Finally, let us remark that removing the assumption that k is constant in the above results remains a challenging open problem that seems to require totally new ideas. Another interesting question is whether stochastic mean payoff games with perfect information can be solved in parameterized pseudo-polynomial time with the number k of stochastic positions as the parameter?
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.

Appendix: Lemmas About Markov Chains
For a situation s, let d G(s) (u, v) be the stochastic distance from u to v in G(s), which is the shortest (directed) distance between vertices u and v in the graph obtained from G(s) by setting the length of every deterministic arc [i.e., one with p e (s) = 1] to 0 and of every stochastic arc [i.e., one with p e (s) ∈ (0, 1)] to 1. Let u) is finite, and s is a situation} be the stochastic diameter of G. Clearly, λ(G) ≤ k(G). Some of our bounds will be given in terms of λ instead of k, which implies stronger bounds on the running-times of some of the approximation schemes. A set of vertices U ⊆ V is called an absorbing class of the Markov chain M if there is no arc with positive probability from U to V \U , i.e., U can never be left once it is entered, and U is strongly connected, i.e., any vertex of U is reachable from any other vertex of U . E), P) be a Markov chain on n vertices with starting vertex u. Then the limiting probability of any vertex v ∈ V is either 0 or at least p 2λ min /n and the limiting probability of any arc (u, v) ∈ E is either 0 or at least p 2λ+1 min /n. Proof Let π and ρ denote the limiting vertex-and arc-distribution, respectively. Let C be any absorbing class of M reachable from u. We deal with π first. Clearly, for any v that does not lie in any of these absorbing classes, we have π v = 0. It remains to show that for every v ∈ C, we have π v ≥ p 2λ min /n. Denote by π C = v∈C π v the total limiting probability of C. Note that π C is equal to the probability that we reach some vertex v ∈ C starting from u. Since there is a simple path in G from u to C with at most λ stochastic vertices, this probability is at least p λ min . Furthermore, there exists a vertex v ∈ C with π v ≥ π C /|C| ≥ p λ min /n. Now for any v ∈ C, there exists again a simple path in G from v to v with at most λ stochastic positions, so the probability that we reach v starting from v is at least p λ min . It follows that π v ≥ p 2λ min /n. Now for ρ, note that ρ (u,v) ≥ π u p min , if (u, v) ∈ E. Since π u is either 0 or at least p 2λ min /n, the claim follows.
A Markov chain is said to be irreducible if its state space is a single absorbing class. For an irreducible Markov chain, let m uv denote the mean first passage time from vertex u to vertex v, and m vv denote the mean return time to vertex v: m uv is the expected number of steps to reach vertex v for the first time, starting from vertex u, and m vv is the expected number of steps to return to vertex v for the first time, starting from vertex v. The following lemma relates these values to the sensitivity of the limiting probabilities of a Markov chain. Lemma 9 (Cho and Meyer [12]) Let ε > 0. Let M = (G = (V, E), P) be an irreducible Markov chain. For any transition probabilitiesP with P − P ∞ ≤ ε such that the corresponding Markov chainM is also irreducible, we have π −π ∞ ≤ Proof Fix the starting vertex u 0 ∈ V . Let π andπ denote the limiting distributions corresponding to M andM, respectively. We first bound π −π ∞ . Since ε < p min , we havep uv = 0 if and only if p uv = 0. It follows that M andM have the same absorbing classes. Let C 1 , . . . , C denote these classes. Denote by π C i = v∈C i π v andπ C i = v∈C iπ v the total limiting probability of C i with respect to π andπ , respectively. Furthermore, let π |i andπ |i be the limiting distributions, corresponding to M andM, respectively, conditioned on the event that the Markov process is started in C i (i.e., u 0 ∈ C i ). Note that these conditional limiting distributions describe the limiting distributions for the irreducible Markov chains restricted to C i . By Lemma 9, we have π |i −π |i ∞ ≤ 1 2 ε · max v∈C i max u∈C i u =v Let h = max{d G (u, v) | u ∈ C i }. For j = 0, 1, . . . , h, let X j = max{m uv | u ∈ C i , d G (u, v) = j}. Let be in argmax{X j | j ∈ {1, . . . , h}}. Then X 0 ≤ |C i | and, for j = 1, . . . , h, (17) implies that X j ≤ |C i | + p min X j−1 + (1 − p min )X .
Indeed, for a vertex for u ∈ V such that d G (u, v) = j, there is a path Q from u to v with j stochastic arcs. Let u be the vertex closest to u on Q such that d G (u , v) = j − 1, and let u be the vertex on Q preceding u . Then u is stochastic, and hence by (17) using the fact that X j ≤ X for all j and p u u ≥ p min . Finally, m uv ≤ |C i |−1+m u v implies (18). Applying (18) for j = 1, . . . , yields This implies that X ≤ |C i | 1− p +1 min 1− p min p − min ≤ |C i |(λ + 1) p −λ min .
Proof Without loss of generality we assume that u 0 / ∈ C i . For a transient vertex v (i.e., one for which π v = 0), let y v andỹ v be the absorption probability into class C i in M andM, respectively. In particular y u 0 = π C i . Let p vC i = u∈C i p vu and p vC i = u∈C ip vu . Then we have Similarly, Write v := |ỹ v − y v |. Subtracting (19) from (20)