# USTC Vision and Multimedia (VIM) - 视觉与多媒体研究组.pdf

Convex Optimization Convex Optimization Stephen Boyd Department of Electrical Engineering Stanford University Lieven Vandenberghe Electrical Engineering Department University of California, Los Angeles cambridge university press Cambridge, New York, Melbourne, Madrid, Cape Town, Singapore, São Paolo, Delhi Cambridge University Press The Edinburgh Building, Cambridge, CB2 8RU, UK Published in the United States of America by Cambridge University Press, New York http://www.cambridge.org Information on this title: www.cambridge.org/9780521833783 c Cambridge University Press 2004 This publication is in copyright. Subject to statutory exception and to the provisions of relevant collective licensing agreements, no reproduction of any part may take place without the written permission of Cambridge University Press. First published 2004 Seventh printing with corrections 2009 Printed in the United Kingdom at the University Press, Cambridge A catalogue record for this publication is available from the British Library Library of Congress Cataloguing-in-Publication data Boyd, Stephen P. Convex Optimization / Stephen Boyd & Lieven Vandenberghe p. cm. Includes bibliographical references and index. ISBN 0 521 83378 7 1. Mathematical optimization. 2. Convex functions. I. Vandenberghe, Lieven. II. Title. QA402.5.B69 2004 519.6–dc22 2003063284 ISBN 978-0-521-83378-3 hardback Cambridge University Press has no responsiblity for the persistency or accuracy of URLs for external or third-party internet websites referred to in this publication, and does not guarantee that any content on such websites is, or will remain, accurate or appropriate. For Anna, Nicholas, and Nora Daniël and Margriet Contents Preface xi 1 Introduction 1 1.1 Mathematical optimization . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Least-squares and linear programming . . . . . . . . . . . . . . . . . . 4 1.3 Convex optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 1.4 Nonlinear optimization . . . . . . . . . . . . . . . . . . . . . . . . . . 9 1.5 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 1.6 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 I Theory 2 Convex sets 2.1 Affine and convex sets . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 Some important examples . . . . . . . . . . . . . . . . . . . . . . . . . 2.3 Operations that preserve convexity . . . . . . . . . . . . . . . . . . . . 2.4 Generalized inequalities . . . . . . . . . . . . . . . . . . . . . . . . . . 2.5 Separating and supporting hyperplanes . . . . . . . . . . . . . . . . . . 2.6 Dual cones and generalized inequalities . . . . . . . . . . . . . . . . . . Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 21 21 27 35 43 46 51 59 60 3 Convex functions 67 3.1 Basic properties and examples . . . . . . . . . . . . . . . . . . . . . . 67 3.2 Operations that preserve convexity . . . . . . . . . . . . . . . . . . . . 79 3.3 The conjugate function . . . . . . . . . . . . . . . . . . . . . . . . . . 90 3.4 Quasiconvex functions . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 3.5 Log-concave and log-convex functions . . . . . . . . . . . . . . . . . . 104 3.6 Convexity with respect to generalized inequalities . . . . . . . . . . . . 108 Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 viii Contents 4 Convex optimization problems 127 4.1 Optimization problems . . . . . . . . . . . . . . . . . . . . . . . . . . 127 4.2 Convex optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136 4.3 Linear optimization problems . . . . . . . . . . . . . . . . . . . . . . . 146 4.4 Quadratic optimization problems . . . . . . . . . . . . . . . . . . . . . 152 4.5 Geometric programming . . . . . . . . . . . . . . . . . . . . . . . . . . 160 4.6 Generalized inequality constraints . . . . . . . . . . . . . . . . . . . . . 167 4.7 Vector optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174 Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189 5 Duality 215 5.1 The Lagrange dual function . . . . . . . . . . . . . . . . . . . . . . . . 215 5.2 The Lagrange dual problem . . . . . . . . . . . . . . . . . . . . . . . . 223 5.3 Geometric interpretation . . . . . . . . . . . . . . . . . . . . . . . . . 232 5.4 Saddle-point interpretation . . . . . . . . . . . . . . . . . . . . . . . . 237 5.5 Optimality conditions . . . . . . . . . . . . . . . . . . . . . . . . . . . 241 5.6 Perturbation and sensitivity analysis . . . . . . . . . . . . . . . . . . . 249 5.7 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253 5.8 Theorems of alternatives . . . . . . . . . . . . . . . . . . . . . . . . . 258 5.9 Generalized inequalities . . . . . . . . . . . . . . . . . . . . . . . . . . 264 Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 272 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273 II Applications 289 6 Approximation and fitting 291 6.1 Norm approximation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 291 6.2 Least-norm problems . . . . . . . . . . . . . . . . . . . . . . . . . . . 302 6.3 Regularized approximation . . . . . . . . . . . . . . . . . . . . . . . . 305 6.4 Robust approximation . . . . . . . . . . . . . . . . . . . . . . . . . . . 318 6.5 Function fitting and interpolation . . . . . . . . . . . . . . . . . . . . . 324 Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 343 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 344 7 Statistical estimation 351 7.1 Parametric distribution estimation . . . . . . . . . . . . . . . . . . . . 351 7.2 Nonparametric distribution estimation . . . . . . . . . . . . . . . . . . 359 7.3 Optimal detector design and hypothesis testing . . . . . . . . . . . . . 364 7.4 Chebyshev and Chernoff bounds . . . . . . . . . . . . . . . . . . . . . 374 7.5 Experiment design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 384 Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 392 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 393 Contents ix 8 Geometric problems 397 8.1 Projection on a set . . . . . . . . . . . . . . . . . . . . . . . . . . . . 397 8.2 Distance between sets . . . . . . . . . . . . . . . . . . . . . . . . . . . 402 8.3 Euclidean distance and angle problems . . . . . . . . . . . . . . . . . . 405 8.4 Extremal volume ellipsoids . . . . . . . . . . . . . . . . . . . . . . . . 410 8.5 Centering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 416 8.6 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 422 8.7 Placement and location . . . . . . . . . . . . . . . . . . . . . . . . . . 432 8.8 Floor planning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 438 Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 446 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 447 III Algorithms 455 9 Unconstrained minimization 457 9.1 Unconstrained minimization problems . . . . . . . . . . . . . . . . . . 457 9.2 Descent methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 463 9.3 Gradient descent method . . . . . . . . . . . . . . . . . . . . . . . . . 466 9.4 Steepest descent method . . . . . . . . . . . . . . . . . . . . . . . . . 475 9.5 Newton’s method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 484 9.6 Self-concordance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 496 9.7 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 508 Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 513 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 514 10 Equality constrained minimization 521 10.1 Equality constrained minimization problems . . . . . . . . . . . . . . . 521 10.2 Newton’s method with equality constraints . . . . . . . . . . . . . . . . 525 10.3 Infeasible start Newton method . . . . . . . . . . . . . . . . . . . . . . 531 10.4 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 542 Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 556 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 557 11 Interior-point methods 561 11.1 Inequality constrained minimization problems . . . . . . . . . . . . . . 561 11.2 Logarithmic barrier function and central path . . . . . . . . . . . . . . 562 11.3 The barrier method . . . . . . . . . . . . . . . . . . . . . . . . . . . . 568 11.4 Feasibility and phase I methods . . . . . . . . . . . . . . . . . . . . . . 579 11.5 Complexity analysis via self-concordance . . . . . . . . . . . . . . . . . 585 11.6 Problems with generalized inequalities . . . . . . . . . . . . . . . . . . 596 11.7 Primal-dual interior-point methods . . . . . . . . . . . . . . . . . . . . 609 11.8 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 615 Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 621 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 623 x Contents Appendices 631 A Mathematical background 633 A.1 Norms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 633 A.2 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 637 A.3 Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 639 A.4 Derivatives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 640 A.5 Linear algebra . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 645 Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 652 B Problems involving two quadratic functions 653 B.1 Single constraint quadratic optimization . . . . . . . . . . . . . . . . . 653 B.2 The S-procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 655 B.3 The field of values of two symmetric matrices . . . . . . . . . . . . . . 656 B.4 Proofs of the strong duality results . . . . . . . . . . . . . . . . . . . . 657 Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 659 C Numerical linear algebra background 661 C.1 Matrix structure and algorithm complexity . . . . . . . . . . . . . . . . 661 C.2 Solving linear equations with factored matrices . . . . . . . . . . . . . . 664 C.3 LU, Cholesky, and LDLT factorization . . . . . . . . . . . . . . . . . . 668 C.4 Block elimination and Schur complements . . . . . . . . . . . . . . . . 672 C.5 Solving underdetermined linear equations . . . . . . . . . . . . . . . . . 681 Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 684 References 685 Notation 697 Index 701 Preface This book is about convex optimization, a special class of mathematical optimization problems, which includes least-squares and linear programming problems. It is well known that least-squares and linear programming problems have a fairly complete theory, arise in a variety of applications, and can be solved numerically very efficiently. The basic point of this book is that the same can be said for the larger class of convex optimization problems. While the mathematics of convex optimization has been studied for about a century, several related recent developments have stimulated new interest in the topic. The first is the recognition that interior-point methods, developed in the 1980s to solve linear programming problems, can be used to solve convex optimization problems as well. These new methods allow us to solve certain new classes of convex optimization problems, such as semidefinite programs and second-order cone programs, almost as easily as linear programs. The second development is the discovery that convex optimization problems (beyond least-squares and linear programs) are more prevalent in practice than was previously thought. Since 1990 many applications have been discovered in areas such as automatic control systems, estimation and signal processing, communications and networks, electronic circuit design, data analysis and modeling, statistics, and finance. Convex optimization has also found wide application in combinatorial optimization and global optimization, where it is used to find bounds on the optimal value, as well as approximate solutions. We believe that many other applications of convex optimization are still waiting to be discovered. There are great advantages to recognizing or formulating a problem as a convex optimization problem. The most basic advantage is that the problem can then be solved, very reliably and efficiently, using interior-point methods or other special methods for convex optimization. These solution methods are reliable enough to be embedded in a computer-aided design or analysis tool, or even a real-time reactive or automatic control system. There are also theoretical or conceptual advantages of formulating a problem as a convex optimization problem. The associated dual problem, for example, often has an interesting interpretation in terms of the original problem, and sometimes leads to an efficient or distributed method for solving it. We think that convex optimization is an important enough topic that everyone who uses computational mathematics should know at least a little bit about it. In our opinion, convex optimization is a natural next topic after advanced linear algebra (topics like least-squares, singular values), and linear programming. xii Preface Goal of this book For many general purpose optimization methods, the typical approach is to just try out the method on the problem to be solved. The full benefits of convex optimization, in contrast, only come when the problem is known ahead of time to be convex. Of course, many optimization problems are not convex, and it can be difficult to recognize the ones that are, or to reformulate a problem so that it is convex. Our main goal is to help the reader develop a working knowledge of convex optimization, i.e., to develop the skills and background needed to recognize, formulate, and solve convex optimization problems. Developing a working knowledge of convex optimization can be mathematically demanding, especially for the reader interested primarily in applications. In our experience (mostly with graduate students in electrical engineering and computer science), the investment often pays off well, and sometimes very well. There are several books on linear programming, and general nonlinear programming, that focus on problem formulation, modeling, and applications. Several other books cover the theory of convex optimization, or interior-point methods and their complexity analysis. This book is meant to be something in between, a book on general convex optimization that focuses on problem formulation and modeling. We should also mention what this book is not. It is not a text primarily about convex analysis, or the mathematics of convex optimization; several existing texts cover these topics well. Nor is the book a survey of algorithms for convex optimization. Instead we have chosen just a few good algorithms, and describe only simple, stylized versions of them (which, however, do work well in practice). We make no attempt to cover the most recent state of the art in interior-point (or other) methods for solving convex problems. Our coverage of numerical implementation issues is also highly simplified, but we feel that it is adequate for the potential user to develop working implementations, and we do cover, in some detail, techniques for exploiting structure to improve the efficiency of the methods. We also do not cover, in more than a simplified way, the complexity theory of the algorithms we describe. We do, however, give an introduction to the important ideas of self-concordance and complexity analysis for interior-point methods. Audience This book is meant for the researcher, scientist, or engineer who uses mathematical optimization, or more generally, computational mathematics. This includes, naturally, those working directly in optimization and operations research, and also many others who use optimization, in fields like computer science, economics, finance, statistics, data mining, and many fields of science and engineering. Our primary focus is on the latter group, the potential users of convex optimization, and not the (less numerous) experts in the field of convex optimization. The only background required of the reader is a good knowledge of advanced calculus and linear algebra. If the reader has seen basic mathematical analysis (e.g., norms, convergence, elementary topology), and basic probability theory, he or she should be able to follow every argument and discussion in the book. We hope that Preface xiii readers who have not seen analysis and probability, however, can still get all of the essential ideas and important points. Prior exposure to numerical computing or optimization is not needed, since we develop all of the needed material from these areas in the text or appendices. Using this book in courses We hope that this book will be useful as the primary or alternate textbook for several types of courses. Since 1995 we have been using drafts of this book for graduate courses on linear, nonlinear, and convex optimization (with engineering applications) at Stanford and UCLA. We are able to cover most of the material, though not in detail, in a one quarter graduate course. A one semester course allows for a more leisurely pace, more applications, more detailed treatment of theory, and perhaps a short student project. A two quarter sequence allows an expanded treatment of the more basic topics such as linear and quadratic programming (which are very useful for the applications oriented student), or a more substantial student project. This book can also be used as a reference or alternate text for a more traditional course on linear and nonlinear optimization, or a course on control systems (or other applications area), that includes some coverage of convex optimization. As the secondary text in a more theoretically oriented course on convex optimization, it can be used as a source of simple practical examples. Acknowledgments We have been developing the material for this book for almost a decade. Over the years we have benefited from feedback and suggestions from many people, including our own graduate students, students in our courses, and our colleagues at Stanford, UCLA, and elsewhere. Unfortunately, space limitations and shoddy record keeping do not allow us to name everyone who has contributed. However, we wish to particularly thank A. Aggarwal, V. Balakrishnan, A. Bernard, B. Bray, R. Cottle, A. d’Aspremont, J. Dahl, J. Dattorro, D. Donoho, J. Doyle, L. El Ghaoui, P. Glynn, M. Grant, A. Hansson, T. Hastie, A. Lewis, M. Lobo, Z.-Q. Luo, M. Mesbahi, W. Naylor, P. Parrilo, I. Pressman, R. Tibshirani, B. Van Roy, L. Xiao, and Y. Ye. J. Jalden and A. d’Aspremont contributed the time-frequency analysis example in §6.5.4, and the consumer preference bounding example in §6.5.5, respectively. P. Parrilo suggested exercises 4.4 and 4.56. Newer printings benefited greatly from Igal Sason’s meticulous reading of the book. We want to single out two others for special acknowledgment. Arkadi Nemirovski incited our original interest in convex optimization, and encouraged us to write this book. We also want to thank Kishan Baheti for playing a critical role in the development of this book. In 1994 he encouraged us to apply for a National Science Foundation combined research and curriculum development grant, on convex optimization with engineering applications, and this book is a direct (if delayed) consequence. Stephen Boyd Lieven Vandenberghe Stanford, California Los Angeles, California Chapter 1 Introduction In this introduction we give an overview of mathematical optimization, focusing on the special role of convex optimization. The concepts introduced informally here will be covered in later chapters, with more care and technical detail. 1.1 Mathematical optimization A mathematical optimization problem, or just optimization problem, has the form minimize subject to f0 (x) fi (x) ≤ bi , i = 1, . . . , m. (1.1) Here the vector x = (x1 , . . . , xn ) is the optimization variable of the problem, the function f0 : Rn → R is the objective function, the functions fi : Rn → R, i = 1, . . . , m, are the (inequality) constraint functions, and the constants b1 , . . . , bm are the limits, or bounds, for the constraints. A vector x⋆ is called optimal, or a solution of the problem (1.1), if it has the smallest objective value among all vectors that satisfy the constraints: for any z with f1 (z) ≤ b1 , . . . , fm (z) ≤ bm , we have f0 (z) ≥ f0 (x⋆ ). We generally consider families or classes of optimization problems, characterized by particular forms of the objective and constraint functions. As an important example, the optimization problem (1.1) is called a linear program if the objective and constraint functions f0 , . . . , fm are linear, i.e., satisfy fi (αx + βy) = αfi (x) + βfi (y) (1.2) for all x, y ∈ Rn and all α, β ∈ R. If the optimization problem is not linear, it is called a nonlinear program. This book is about a class of optimization problems called convex optimization problems. A convex optimization problem is one in which the objective and constraint functions are convex, which means they satisfy the inequality fi (αx + βy) ≤ αfi (x) + βfi (y) (1.3) 2 1 Introduction for all x, y ∈ Rn and all α, β ∈ R with α + β = 1, α ≥ 0, β ≥ 0. Comparing (1.3) and (1.2), we see that convexity is more general than linearity: inequality replaces the more restrictive equality, and the inequality must hold only for certain values of α and β. Since any linear program is therefore a convex optimization problem, we can consider convex optimization to be a generalization of linear programming. 1.1.1 Applications The optimization problem (1.1) is an abstraction of the problem of making the best possible choice of a vector in Rn from a set of candidate choices. The variable x represents the choice made; the constraints fi (x) ≤ bi represent firm requirements or specifications that limit the possible choices, and the objective value f0 (x) represents the cost of choosing x. (We can also think of −f0 (x) as representing the value, or utility, of choosing x.) A solution of the optimization problem (1.1) corresponds to a choice that has minimum cost (or maximum utility), among all choices that meet the firm requirements. In portfolio optimization, for example, we seek the best way to invest some capital in a set of n assets. The variable xi represents the investment in the ith asset, so the vector x ∈ Rn describes the overall portfolio allocation across the set of assets. The constraints might represent a limit on the budget (i.e., a limit on the total amount to be invested), the requirement that investments are nonnegative (assuming short positions are not allowed), and a minimum acceptable value of expected return for the whole portfolio. The objective or cost function might be a measure of the overall risk or variance of the portfolio return. In this case, the optimization problem (1.1) corresponds to choosing a portfolio allocation that minimizes risk, among all possible allocations that meet the firm requirements. Another example is device sizing in electronic design, which is the task of choosing the width and length of each device in an electronic circuit. Here the variables represent the widths and lengths of the devices. The constraints represent a variety of engineering requirements, such as limits on the device sizes imposed by the manufacturing process, timing requirements that ensure that the circuit can operate reliably at a specified speed, and a limit on the total area of the circuit. A common objective in a device sizing problem is the total power consumed by the circuit. The optimization problem (1.1) is to find the device sizes that satisfy the design requirements (on manufacturability, timing, and area) and are most power efficient. In data fitting, the task is to find a model, from a family of potential models, that best fits some observed data and prior information. Here the variables are the parameters in the model, and the constraints can represent prior information or required limits on the parameters (such as nonnegativity). The objective function might be a measure of misfit or prediction error between the observed data and the values predicted by the model, or a statistical measure of the unlikeliness or implausibility of the parameter values. The optimization problem (1.1) is to find the model parameter values that are consistent with the prior information, and give the smallest misfit or prediction error with the observed data (or, in a statistical 1.1 Mathematical optimization framework, are most likely). An amazing variety of practical problems involving decision making (or system design, analysis, and operation) can be cast in the form of a mathematical optimization problem, or some variation such as a multicriterion optimization problem. Indeed, mathematical optimization has become an important tool in many areas. It is widely used in engineering, in electronic design automation, automatic control systems, and optimal design problems arising in civil, chemical, mechanical, and aerospace engineering. Optimization is used for problems arising in network design and operation, finance, supply chain management, scheduling, and many other areas. The list of applications is still steadily expanding. For most of these applications, mathematical optimization is used as an aid to a human decision maker, system designer, or system operator, who supervises the process, checks the results, and modifies the problem (or the solution approach) when necessary. This human decision maker also carries out any actions suggested by the optimization problem, e.g., buying or selling assets to achieve the optimal portfolio. A relatively recent phenomenon opens the possibility of many other applications for mathematical optimization. With the proliferation of computers embedded in products, we have seen a rapid growth in embedded optimization. In these embedded applications, optimization is used to automatically make real-time choices, and even carry out the associated actions, with no (or little) human intervention or oversight. In some application areas, this blending of traditional automatic control systems and embedded optimization is well under way; in others, it is just starting. Embedded real-time optimization raises some new challenges: in particular, it requires solution methods that are extremely reliable, and solve problems in a predictable amount of time (and memory). 1.1.2 Solving optimization problems A solution method for a class of optimization problems is an algorithm that computes a solution of the problem (to some given accuracy), given a particular problem from the class, i.e., an instance of the problem. Since the late 1940s, a large effort has gone into developing algorithms for solving various classes of optimization problems, analyzing their properties, and developing good software implementations. The effectiveness of these algorithms, i.e., our ability to solve the optimization problem (1.1), varies considerably, and depends on factors such as the particular forms of the objective and constraint functions, how many variables and constraints there are, and special structure, such as sparsity. (A problem is sparse if each constraint function depends on only a small number of the variables). Even when the objective and constraint functions are smooth (for example, polynomials) the general optimization problem (1.1) is surprisingly difficult to solve. Approaches to the general problem therefore involve some kind of compromise, such as very long computation time, or the possibility of not finding the solution. Some of these methods are discussed in §1.4. There are, however, some important exceptions to the general rule that most optimization problems are difficult to solve. For a few problem classes we have 3 4 1 Introduction effective algorithms that can reliably solve even large problems, with hundreds or thousands of variables and constraints. Two important and well known examples, described in §1.2 below (and in detail in chapter 4), are least-squares problems and linear programs. It is less well known that convex optimization is another exception to the rule: Like least-squares or linear programming, there are very effective algorithms that can reliably and efficiently solve even large convex problems. 1.2 Least-squares and linear programming In this section we describe two very widely known and used special subclasses of convex optimization: least-squares and linear programming. (A complete technical treatment of these problems will be given in chapter 4.) 1.2.1 Least-squares problems A least-squares problem is an optimization problem with no constraints (i.e., m = 0) and an objective which is a sum of squares of terms of the form aTi x − bi : minimize f0 (x) = kAx − bk22 = Pk T 2 i=1 (ai x − bi ) . (1.4) Here A ∈ Rk×n (with k ≥ n), aTi are the rows of A, and the vector x ∈ Rn is the optimization variable. Solving least-squares problems The solution of a least-squares problem (1.4) can be reduced to solving a set of linear equations, (AT A)x = AT b, so we have the analytical solution x = (AT A)−1 AT b. For least-squares problems we have good algorithms (and software implementations) for solving the problem to high accuracy, with very high reliability. The least-squares problem can be solved in a time approximately proportional to n2 k, with a known constant. A current desktop computer can solve a least-squares problem with hundreds of variables, and thousands of terms, in a few seconds; more powerful computers, of course, can solve larger problems, or the same size problems, faster. (Moreover, these solution times will decrease exponentially in the future, according to Moore’s law.) Algorithms and software for solving least-squares problems are reliable enough for embedded optimization. In many cases we can solve even larger least-squares problems, by exploiting some special structure in the coefficient matrix A. Suppose, for example, that the matrix A is sparse, which means that it has far fewer than kn nonzero entries. By exploiting sparsity, we can usually solve the least-squares problem much faster than order n2 k. A current desktop computer can solve a sparse least-squares problem 1.2 Least-squares and linear programming 5 with tens of thousands of variables, and hundreds of thousands of terms, in around a minute (although this depends on the particular sparsity pattern). For extremely large problems (say, with millions of variables), or for problems with exacting real-time computing requirements, solving a least-squares problem can be a challenge. But in the vast majority of cases, we can say that existing methods are very effective, and extremely reliable. Indeed, we can say that solving least-squares problems (that are not on the boundary of what is currently achievable) is a (mature) technology, that can be reliably used by many people who do not know, and do not need to know, the details. Using least-squares The least-squares problem is the basis for regression analysis, optimal control, and many parameter estimation and data fitting methods. It has a number of statistical interpretations, e.g., as maximum likelihood estimation of a vector x, given linear measurements corrupted by Gaussian measurement errors. Recognizing an optimization problem as a least-squares problem is straightforward; we only need to verify that the objective is a quadratic function (and then test whether the associated quadratic form is positive semidefinite). While the basic least-squares problem has a simple fixed form, several standard techniques are used to increase its flexibility in applications. In weighted least-squares, the weighted least-squares cost k X i=1 wi (aTi x − bi )2 , where w1 , . . . , wk are positive, is minimized. (This problem is readily cast and solved as a standard least-squares problem.) Here the weights wi are chosen to reflect differing levels of concern about the sizes of the terms aTi x − bi , or simply to influence the solution. In a statistical setting, weighted least-squares arises in estimation of a vector x, given linear measurements corrupted by errors with unequal variances. Another technique in least-squares is regularization, in which extra terms are added to the cost function. In the simplest case, a positive multiple of the sum of squares of the variables is added to the cost function: k X i=1 (aTi x − bi )2 + ρ n X x2i , i=1 where ρ > 0. (This problem too can be formulated as a standard least-squares problem.) The extra terms penalize large values of x, and result in a sensible solution in cases when minimizing the first sum only does not. The parameter ρ is chosen byPthe user to give the right trade-off between Pn making the original objective k function i=1 (aTi x − bi )2 small, while keeping i=1 x2i not too big. Regularization comes up in statistical estimation when the vector x to be estimated is given a prior distribution. Weighted least-squares and regularization are covered in chapter 6; their statistical interpretations are given in chapter 7. 6 1.2.2 1 Introduction Linear programming Another important class of optimization problems is linear programming, in which the objective and all constraint functions are linear: minimize subject to cT x aTi x ≤ bi , i = 1, . . . , m. (1.5) Here the vectors c, a1 , . . . , am ∈ Rn and scalars b1 , . . . , bm ∈ R are problem parameters that specify the objective and constraint functions. Solving linear programs There is no simple analytical formula for the solution of a linear program (as there is for a least-squares problem), but there are a variety of very effective methods for solving them, including Dantzig’s simplex method, and the more recent interiorpoint methods described later in this book. While we cannot give the exact number of arithmetic operations required to solve a linear program (as we can for leastsquares), we can establish rigorous bounds on the number of operations required to solve a linear program, to a given accuracy, using an interior-point method. The complexity in practice is order n2 m (assuming m ≥ n) but with a constant that is less well characterized than for least-squares. These algorithms are quite reliable, although perhaps not quite as reliable as methods for least-squares. We can easily solve problems with hundreds of variables and thousands of constraints on a small desktop computer, in a matter of seconds. If the problem is sparse, or has some other exploitable structure, we can often solve problems with tens or hundreds of thousands of variables and constraints. As with least-squares problems, it is still a challenge to solve extremely large linear programs, or to solve linear programs with exacting real-time computing requirements. But, like least-squares, we can say that solving (most) linear programs is a mature technology. Linear programming solvers can be (and are) embedded in many tools and applications. Using linear programming Some applications lead directly to linear programs in the form (1.5), or one of several other standard forms. In many other cases the original optimization problem does not have a standard linear program form, but can be transformed to an equivalent linear program (and then, of course, solved) using techniques covered in detail in chapter 4. As a simple example, consider the Chebyshev approximation problem: minimize maxi=1,...,k |aTi x − bi |. (1.6) Here x ∈ Rn is the variable, and a1 , . . . , ak ∈ Rn , b1 , . . . , bk ∈ R are parameters that specify the problem instance. Note the resemblance to the least-squares problem (1.4). For both problems, the objective is a measure of the size of the terms aTi x − bi . In least-squares, we use the sum of squares of the terms as objective, whereas in Chebyshev approximation, we use the maximum of the absolute values. 1.3 Convex optimization 7 One other important distinction is that the objective function in the Chebyshev approximation problem (1.6) is not differentiable; the objective in the least-squares problem (1.4) is quadratic, and therefore differentiable. The Chebyshev approximation problem (1.6) can be solved by solving the linear program minimize t subject to aTi x − t ≤ bi , i = 1, . . . , k (1.7) −aTi x − t ≤ −bi , i = 1, . . . , k, with variables x ∈ Rn and t ∈ R. (The details will be given in chapter 6.) Since linear programs are readily solved, the Chebyshev approximation problem is therefore readily solved. Anyone with a working knowledge of linear programming would recognize the Chebyshev approximation problem (1.6) as one that can be reduced to a linear program. For those without this background, though, it might not be obvious that the Chebyshev approximation problem (1.6), with its nondifferentiable objective, can be formulated and solved as a linear program. While recognizing problems that can be reduced to linear programs is more involved than recognizing a least-squares problem, it is a skill that is readily acquired, since only a few standard tricks are used. The task can even be partially automated; some software systems for specifying and solving optimization problems can automatically recognize (some) problems that can be reformulated as linear programs. 1.3 Convex optimization A convex optimization problem is one of the form minimize subject to f0 (x) fi (x) ≤ bi , i = 1, . . . , m, (1.8) where the functions f0 , . . . , fm : Rn → R are convex, i.e., satisfy fi (αx + βy) ≤ αfi (x) + βfi (y) for all x, y ∈ Rn and all α, β ∈ R with α + β = 1, α ≥ 0, β ≥ 0. The least-squares problem (1.4) and linear programming problem (1.5) are both special cases of the general convex optimization problem (1.8). 1.3.1 Solving convex optimization problems There is in general no analytical formula for the solution of convex optimization problems, but (as with linear programming problems) there are very effective methods for solving them. Interior-point methods work very well in practice, and in some cases can be proved to solve the problem to a specified accuracy with a number of 8 1 Introduction operations that does not exceed a polynomial of the problem dimensions. (This is covered in chapter 11.) We will see that interior-point methods can solve the problem (1.8) in a number of steps or iterations that is almost always in the range between 10 and 100. Ignoring any structure in the problem (such as sparsity), each step requires on the order of max{n3 , n2 m, F } operations, where F is the cost of evaluating the first and second derivatives of the objective and constraint functions f0 , . . . , fm . Like methods for solving linear programs, these interior-point methods are quite reliable. We can easily solve problems with hundreds of variables and thousands of constraints on a current desktop computer, in at most a few tens of seconds. By exploiting problem structure (such as sparsity), we can solve far larger problems, with many thousands of variables and constraints. We cannot yet claim that solving general convex optimization problems is a mature technology, like solving least-squares or linear programming problems. Research on interior-point methods for general nonlinear convex optimization is still a very active research area, and no consensus has emerged yet as to what the best method or methods are. But it is reasonable to expect that solving general convex optimization problems will become a technology within a few years. And for some subclasses of convex optimization problems, for example second-order cone programming or geometric programming (studied in detail in chapter 4), it is fair to say that interior-point methods are approaching a technology. 1.3.2 Using convex optimization Using convex optimization is, at least conceptually, very much like using leastsquares or linear programming. If we can formulate a problem as a convex optimization problem, then we can solve it efficiently, just as we can solve a least-squares problem efficiently. With only a bit of exaggeration, we can say that, if you formulate a practical problem as a convex optimization problem, then you have solved the original problem. There are also some important differences. Recognizing a least-squares problem is straightforward, but recognizing a convex function can be difficult. In addition, there are many more tricks for transforming convex problems than for transforming linear programs. Recognizing convex optimization problems, or those that can be transformed to convex optimization problems, can therefore be challenging. The main goal of this book is to give the reader the background needed to do this. Once the skill of recognizing or formulating convex optimization problems is developed, you will find that surprisingly many problems can be solved via convex optimization. The challenge, and art, in using convex optimization is in recognizing and formulating the problem. Once this formulation is done, solving the problem is, like least-squares or linear programming, (almost) technology. 1.4 1.4 Nonlinear optimization Nonlinear optimization Nonlinear optimization (or nonlinear programming) is the term used to describe an optimization problem when the objective or constraint functions are not linear, but not known to be convex. Sadly, there are no effective methods for solving the general nonlinear programming problem (1.1). Even simple looking problems with as few as ten variables can be extremely challenging, while problems with a few hundreds of variables can be intractable. Methods for the general nonlinear programming problem therefore take several different approaches, each of which involves some compromise. 1.4.1 Local optimization In local optimization, the compromise is to give up seeking the optimal x, which minimizes the objective over all feasible points. Instead we seek a point that is only locally optimal, which means that it minimizes the objective function among feasible points that are near it, but is not guaranteed to have a lower objective value than all other feasible points. A large fraction of the research on general nonlinear programming has focused on methods for local optimization, which as a consequence are well developed. Local optimization methods can be fast, can handle large-scale problems, and are widely applicable, since they only require differentiability of the objective and constraint functions. As a result, local optimization methods are widely used in applications where there is value in finding a good point, if not the very best. In an engineering design application, for example, local optimization can be used to improve the performance of a design originally obtained by manual, or other, design methods. There are several disadvantages of local optimization methods, beyond (possibly) not finding the true, globally optimal solution. The methods require an initial guess for the optimization variable. This initial guess or starting point is critical, and can greatly affect the objective value of the local solution obtained. Little information is provided about how far from (globally) optimal the local solution is. Local optimization methods are often sensitive to algorithm parameter values, which may need to be adjusted for a particular problem, or family of problems. Using a local optimization method is trickier than solving a least-squares problem, linear program, or convex optimization problem. It involves experimenting with the choice of algorithm, adjusting algorithm parameters, and finding a good enough initial guess (when one instance is to be solved) or a method for producing a good enough initial guess (when a family of problems is to be solved). Roughly speaking, local optimization methods are more art than technology. Local optimization is a well developed art, and often very effective, but it is nevertheless an art. In contrast, there is little art involved in solving a least-squares problem or a linear program (except, of course, those on the boundary of what is currently possible). An interesting comparison can be made between local optimization methods for nonlinear programming, and convex optimization. Since differentiability of the ob- 9 10 1 Introduction jective and constraint functions is the only requirement for most local optimization methods, formulating a practical problem as a nonlinear optimization problem is relatively straightforward. The art in local optimization is in solving the problem (in the weakened sense of finding a locally optimal point), once it is formulated. In convex optimization these are reversed: The art and challenge is in problem formulation; once a problem is formulated as a convex optimization problem, it is relatively straightforward to solve it. 1.4.2 Global optimization In global optimization, the true global solution of the optimization problem (1.1) is found; the compromise is efficiency. The worst-case complexity of global optimization methods grows exponentially with the problem sizes n and m; the hope is that in practice, for the particular problem instances encountered, the method is far faster. While this favorable situation does occur, it is not typical. Even small problems, with a few tens of variables, can take a very long time (e.g., hours or days) to solve. Global optimization is used for problems with a small number of variables, where computing time is not critical, and the value of finding the true global solution is very high. One example from engineering design is worst-case analysis or verification of a high value or safety-critical system. Here the variables represent uncertain parameters, that can vary during manufacturing, or with the environment or operating condition. The objective function is a utility function, i.e., one for which smaller values are worse than larger values, and the constraints represent prior knowledge about the possible parameter values. The optimization problem (1.1) is the problem of finding the worst-case values of the parameters. If the worst-case value is acceptable, we can certify the system as safe or reliable (with respect to the parameter variations). A local optimization method can rapidly find a set of parameter values that is bad, but not guaranteed to be the absolute worst possible. If a local optimization method finds parameter values that yield unacceptable performance, it has succeeded in determining that the system is not reliable. But a local optimization method cannot certify the system as reliable; it can only fail to find bad parameter values. A global optimization method, in contrast, will find the absolute worst values of the parameters, and if the associated performance is acceptable, can certify the system as safe. The cost is computation time, which can be very large, even for a relatively small number of parameters. But it may be worth it in cases where the value of certifying the performance is high, or the cost of being wrong about the reliability or safety is high. 1.4.3 Role of convex optimization in nonconvex problems In this book we focus primarily on convex optimization problems, and applications that can be reduced to convex optimization problems. But convex optimization also plays an important role in problems that are not convex. 1.5 Outline Initialization for local optimization One obvious use is to combine convex optimization with a local optimization method. Starting with a nonconvex problem, we first find an approximate, but convex, formulation of the problem. By solving this approximate problem, which can be done easily and without an initial guess, we obtain the exact solution to the approximate convex problem. This point is then used as the starting point for a local optimization method, applied to the original nonconvex problem. Convex heuristics for nonconvex optimization Convex optimization is the basis for several heuristics for solving nonconvex problems. One interesting example we will see is the problem of finding a sparse vector x (i.e., one with few nonzero entries) that satisfies some constraints. While this is a difficult combinatorial problem, there are some simple heuristics, based on convex optimization, that often find fairly sparse solutions. (These are described in chapter 6.) Another broad example is given by randomized algorithms, in which an approximate solution to a nonconvex problem is found by drawing some number of candidates from a probability distribution, and taking the best one found as the approximate solution. Now suppose the family of distributions from which we will draw the candidates is parametrized, e.g., by its mean and covariance. We can then pose the question, which of these distributions gives us the smallest expected value of the objective? It turns out that this problem is sometimes a convex problem, and therefore efficiently solved. (See, e.g., exercise 11.23.) Bounds for global optimization Many methods for global optimization require a cheaply computable lower bound on the optimal value of the nonconvex problem. Two standard methods for doing this are based on convex optimization. In relaxation, each nonconvex constraint is replaced with a looser, but convex, constraint. In Lagrangian relaxation, the Lagrangian dual problem (described in chapter 5) is solved. This problem is convex, and provides a lower bound on the optimal value of the nonconvex problem. 1.5 Outline The book is divided into three main parts, titled Theory, Applications, and Algorithms. 1.5.1 Part I: Theory In part I, Theory, we cover basic definitions, concepts, and results from convex analysis and convex optimization. We make no attempt to be encyclopedic, and skew our selection of topics toward those that we think are useful in recognizing 11 12 1 Introduction and formulating convex optimization problems. This is classical material, almost all of which can be found in other texts on convex analysis and optimization. We make no attempt to give the most general form of the results; for that the reader can refer to any of the standard texts on convex analysis. Chapters 2 and 3 cover convex sets and convex functions, respectively. We give some common examples of convex sets and functions, as well as a number of convex calculus rules, i.e., operations on sets and functions that preserve convexity. Combining the basic examples with the convex calculus rules allows us to form (or perhaps more importantly, recognize) some fairly complicated convex sets and functions. In chapter 4, Convex optimization problems, we give a careful treatment of optimization problems, and describe a number of transformations that can be used to reformulate problems. We also introduce some common subclasses of convex optimization, such as linear programming and geometric programming, and the more recently developed second-order cone programming and semidefinite programming. Chapter 5 covers Lagrangian duality, which plays a central role in convex optimization. Here we give the classical Karush-Kuhn-Tucker conditions for optimality, and a local and global sensitivity analysis for convex optimization problems. 1.5.2 Part II: Applications In part II, Applications, we describe a variety of applications of convex optimization, in areas like probability and statistics, computational geometry, and data fitting. We have described these applications in a way that is accessible, we hope, to a broad audience. To keep each application short, we consider only simple cases, sometimes adding comments about possible extensions. We are sure that our treatment of some of the applications will cause experts to cringe, and we apologize to them in advance. But our goal is to convey the flavor of the application, quickly and to a broad audience, and not to give an elegant, theoretically sound, or complete treatment. Our own backgrounds are in electrical engineering, in areas like control systems, signal processing, and circuit analysis and design. Although we include these topics in the courses we teach (using this book as the main text), only a few of these applications are broadly enough accessible to be included here. The aim of part II is to show the reader, by example, how convex optimization can be applied in practice. 1.5.3 Part III: Algorithms In part III, Algorithms, we describe numerical methods for solving convex optimization problems, focusing on Newton’s algorithm and interior-point methods. Part III is organized as three chapters, which cover unconstrained optimization, equality constrained optimization, and inequality constrained optimization, respectively. These chapters follow a natural hierarchy, in which solving a problem is reduced to solving a sequence of simpler problems. Quadratic optimization problems (including, e.g., least-squares) form the base of the hierarchy; they can be 1.5 Outline solved exactly by solving a set of linear equations. Newton’s method, developed in chapters 9 and 10, is the next level in the hierarchy. In Newton’s method, solving an unconstrained or equality constrained problem is reduced to solving a sequence of quadratic problems. In chapter 11, we describe interior-point methods, which form the top level of the hierarchy. These methods solve an inequality constrained problem by solving a sequence of unconstrained, or equality constrained, problems. Overall we cover just a handful of algorithms, and omit entire classes of good methods, such as quasi-Newton, conjugate-gradient, bundle, and cutting-plane algorithms. For the methods we do describe, we give simplified variants, and not the latest, most sophisticated versions. Our choice of algorithms was guided by several criteria. We chose algorithms that are simple (to describe and implement), but also reliable and robust, and effective and fast enough for most problems. Many users of convex optimization end up using (but not developing) standard software, such as a linear or semidefinite programming solver. For these users, the material in part III is meant to convey the basic flavor of the methods, and give some ideas of their basic attributes. For those few who will end up developing new algorithms, we think that part III serves as a good introduction. 1.5.4 Appendices There are three appendices. The first lists some basic facts from mathematics that we use, and serves the secondary purpose of setting out our notation. The second appendix covers a fairly particular topic, optimization problems with quadratic objective and one quadratic constraint. These are nonconvex problems that nevertheless can be effectively solved, and we use the results in several of the applications described in part II. The final appendix gives a brief introduction to numerical linear algebra, concentrating on methods that can exploit problem structure, such as sparsity, to gain efficiency. We do not cover a number of important topics, including roundoff analysis, or give any details of the methods used to carry out the required factorizations. These topics are covered by a number of excellent texts. 1.5.5 Comments on examples In many places in the text (but particularly in parts II and III, which cover applications and algorithms, respectively) we illustrate ideas using specific examples. In some cases, the examples are chosen (or designed) specifically to illustrate our point; in other cases, the examples are chosen to be ‘typical’. This means that the examples were chosen as samples from some obvious or simple probability distribution. The dangers of drawing conclusions about algorithm performance from a few tens or hundreds of randomly generated examples are well known, so we will not repeat them here. These examples are meant only to give a rough idea of algorithm performance, or a rough idea of how the computational effort varies with problem dimensions, and not as accurate predictors of algorithm performance. In particular, your results may vary from ours. 13 14 1.5.6 1 Introduction Comments on exercises Each chapter concludes with a set of exercises. Some involve working out the details of an argument or claim made in the text. Others focus on determining, or establishing, convexity of some given sets, functions, or problems; or more generally, convex optimization problem formulation. Some chapters include numerical exercises, which require some (but not much) programming in an appropriate high level language. The difficulty level of the exercises is mixed, and varies without warning from quite straightforward to rather tricky. 1.6 Notation Our notation is more or less standard, with a few exceptions. In this section we describe our basic notation; a more complete list appears on page 697. We use R to denote the set of real numbers, R+ to denote the set of nonnegative real numbers, and R++ to denote the set of positive real numbers. The set of real n-vectors is denoted Rn , and the set of real m × n matrices is denoted Rm×n . We delimit vectors and matrices with square brackets, with the components separated by space. We use parentheses to construct column vectors from comma separated lists. For example, if a, b, c ∈ R, we have a (a, b, c) = b = [ a b c ]T , c which is an element of R3 . The symbol 1 denotes a vector all of whose components are one (with dimension determined from context). The notation xi can refer to the ith component of the vector x, or to the ith element of a set or sequence of vectors x1 , x2 , . . .. The context, or the text, makes it clear which is meant. We use Sk to denote the set of symmetric k × k matrices, Sk+ to denote the set of symmetric positive semidefinite k × k matrices, and Sk++ to denote the set of symmetric positive definite k × k matrices. The curled inequality symbol (and its strict form ≻) is used to denote generalized inequality: between vectors, it represents componentwise inequality; between symmetric matrices, it represents matrix inequality. With a subscript, the symbol K (or ≺K ) denotes generalized inequality with respect to the cone K (explained in §2.4.1). Our notation for describing functions deviates a bit from standard notation, but we hope it will cause no confusion. We use the notation f : Rp → Rq to mean that f is an Rq -valued function on some subset of Rp , specifically, its domain, which we denote dom f . We can think of our use of the notation f : Rp → Rq as a declaration of the function type, as in a computer language: f : Rp → Rq means that the function f takes as argument a real p-vector, and returns a real q-vector. The set dom f , the domain of the function f , specifies the subset of Rp of points x for which f (x) is defined. As an example, we describe the logarithm function as log : R → R, with dom log = R++ . The notation log : R → R means that 1.6 Notation the logarithm function accepts and returns a real number; dom log = R++ means that the logarithm is defined only for positive numbers. We use Rn as a generic finite-dimensional vector space. We will encounter several other finite-dimensional vector spaces, e.g., the space of polynomials of a variable with a given maximum degree, or the space Sk of symmetric k×k matrices. By identifying a basis for a vector space, we can always identify it with Rn (where n is its dimension), and therefore the generic results, stated for the vector space Rn , can be applied. We usually leave it to the reader to translate general results or statements to other vector spaces. For example, any linear function f : Rn → R can be represented in the form f (x) = cT x, where c ∈ Rn . The corresponding statement for the vector space Sk can be found by choosing a basis and translating. This results in the statement: any linear function f : Sk → R can be represented in the form f (X) = tr(CX), where C ∈ Sk . 15 16 1 Introduction Bibliography Least-squares is a very old subject; see, for example, the treatise written (in Latin) by Gauss in the 1820s, and recently translated by Stewart [Gau95]. More recent work includes the books by Lawson and Hanson [LH95] and Björck [Bjö96]. References on linear programming can be found in chapter 4. There are many good texts on local methods for nonlinear programming, including Gill, Murray, and Wright [GMW81], Nocedal and Wright [NW99], Luenberger [Lue84], and Bertsekas [Ber99]. Global optimization is covered in the books by Horst and Pardalos [HP94], Pinter [Pin95], and Tuy [Tuy98]. Using convex optimization to find bounds for nonconvex problems is an active research topic, and addressed in the books above on global optimization, the book by Ben-Tal and Nemirovski [BTN01, §4.3], and the survey by Nesterov, Wolkowicz, and Ye [NWY00]. Some notable papers on this subject are Goemans and Williamson [GW95], Nesterov [Nes00, Nes98], Ye [Ye99], and Parrilo [Par03]. Randomized methods are discussed in Motwani and Raghavan [MR95]. Convex analysis, the mathematics of convex sets, functions, and optimization problems, is a well developed subfield of mathematics. Basic references include the books by Rockafellar [Roc70], Hiriart-Urruty and Lemaréchal [HUL93, HUL01], Borwein and Lewis [BL00], and Bertsekas, Nedić, and Ozdaglar [Ber03]. More references on convex analysis can be found in chapters 2–5. Nesterov and Nemirovski [NN94] were the first to point out that interior-point methods can solve many convex optimization problems; see also the references in chapter 11. The book by Ben-Tal and Nemirovski [BTN01] covers modern convex optimization, interiorpoint methods, and applications. Solution methods for convex optimization that we do not cover in this book include subgradient methods [Sho85], bundle methods [HUL93], cutting-plane methods [Kel60, EM75, GLY96], and the ellipsoid method [Sho91, BGT81]. The idea that convex optimization problems are tractable is not new. It has long been recognized that the theory of convex optimization is far more straightforward (and complete) than the theory of general nonlinear optimization. In this context Rockafellar stated, in his 1993 SIAM Review survey paper [Roc93], In fact the great watershed in optimization isn’t between linearity and nonlinearity, but convexity and nonconvexity. The first formal argument that convex optimization problems are easier to solve than general nonlinear optimization problems was made by Nemirovski and Yudin, in their 1983 book Problem Complexity and Method Efficiency in Optimization [NY83]. They showed that the information-based complexity of convex optimization problems is far lower than that of general nonlinear optimization problems. A more recent book on this topic is Vavasis [Vav91]. The low (theoretical) complexity of interior-point methods is integral to modern research in this area. Much of the research focuses on proving that an interior-point (or other) method can solve some class of convex optimization problems with a number of operations that grows no faster than a polynomial of the problem dimensions and log(1/ǫ), where ǫ > 0 is the required accuracy. (We will see some simple results like these in chapter 11.) The first comprehensive work on this topic is the book by Nesterov and Nemirovski [NN94]. Other books include Ben-Tal and Nemirovski [BTN01, lecture 5] and Renegar [Ren01]. The polynomial-time complexity of interior-point methods for various convex optimization problems is in marked contrast to the situation for a number of nonconvex optimization problems, for which all known algorithms require, in the worst case, a number of operations that is exponential in the problem dimensions. Bibliography Convex optimization has been used in many applications areas, too numerous to cite here. Convex analysis is central in economics and finance, where it is the basis of many results. For example the separating hyperplane theorem, together with a no-arbitrage assumption, is used to deduce the existence of prices and risk-neutral probabilities (see, e.g., Luenberger [Lue95, Lue98] and Ross [Ros99]). Convex optimization, especially our ability to solve semidefinite programs, has recently received particular attention in automatic control theory. Applications of convex optimization in control theory can be found in the books by Boyd and Barratt [BB91], Boyd, El Ghaoui, Feron, and Balakrishnan [BEFB94], Dahleh and Diaz-Bobillo [DDB95], El Ghaoui and Niculescu [EN00], and Dullerud and Paganini [DP00]. A good example of embedded (convex) optimization is model predictive control, an automatic control technique that requires the solution of a (convex) quadratic program at each step. Model predictive control is now widely used in the chemical process control industry; see Morari and Zafirou [MZ89]. Another applications area where convex optimization (and especially, geometric programming) has a long history is electronic circuit design. Research papers on this topic include Fishburn and Dunlop [FD85], Sapatnekar, Rao, Vaidya, and Kang [SRVK93], and Hershenson, Boyd, and Lee [HBL01]. Luo [Luo03] gives a survey of applications in signal processing and communications. More references on applications of convex optimization can be found in chapters 4 and 6–8. High quality implementations of recent interior-point methods for convex optimization problems are available in the LOQO [Van97] and MOSEK [MOS02] software packages, and the codes listed in chapter 11. Software systems for specifying optimization problems include AMPL [FGK99] and GAMS [BKMR98]. Both provide some support for recognizing problems that can be transformed to linear programs. 17 Part I Theory Chapter 2 Convex sets 2.1 Affine and convex sets 2.1.1 Lines and line segments Suppose x1 6= x2 are two points in Rn . Points of the form y = θx1 + (1 − θ)x2 , where θ ∈ R, form the line passing through x1 and x2 . The parameter value θ = 0 corresponds to y = x2 , and the parameter value θ = 1 corresponds to y = x1 . Values of the parameter θ between 0 and 1 correspond to the (closed) line segment between x1 and x2 . Expressing y in the form y = x2 + θ(x1 − x2 ) gives another interpretation: y is the sum of the base point x2 (corresponding to θ = 0) and the direction x1 − x2 (which points from x2 to x1 ) scaled by the parameter θ. Thus, θ gives the fraction of the way from x2 to x1 where y lies. As θ increases from 0 to 1, the point y moves from x2 to x1 ; for θ > 1, the point y lies on the line beyond x1 . This is illustrated in figure 2.1. 2.1.2 Affine sets A set C ⊆ Rn is affine if the line through any two distinct points in C lies in C, i.e., if for any x1 , x2 ∈ C and θ ∈ R, we have θx1 + (1 − θ)x2 ∈ C. In other words, C contains the linear combination of any two points in C, provided the coefficients in the linear combination sum to one. This idea can be generalized to more than two points. We refer to a point of the form θ1 x1 + · · · + θk xk , where θ1 + · · · + θk = 1, as an affine combination of the points x1 , . . . , xk . Using induction from the definition of affine set (i.e., that it contains every affine combination of two points in it), it can be shown that 22 2 θ = 1.2 Convex sets x1 θ=1 θ = 0.6 x2 θ=0 θ = −0.2 Figure 2.1 The line passing through x1 and x2 is described parametrically by θx1 + (1 − θ)x2 , where θ varies over R. The line segment between x1 and x2 , which corresponds to θ between 0 and 1, is shown darker. an affine set contains every affine combination of its points: If C is an affine set, x1 , . . . , xk ∈ C, and θ1 + · · · + θk = 1, then the point θ1 x1 + · · · + θk xk also belongs to C. If C is an affine set and x0 ∈ C, then the set V = C − x0 = {x − x0 | x ∈ C} is a subspace, i.e., closed under sums and scalar multiplication. To see this, suppose v1 , v2 ∈ V and α, β ∈ R. Then we have v1 + x0 ∈ C and v2 + x0 ∈ C, and so αv1 + βv2 + x0 = α(v1 + x0 ) + β(v2 + x0 ) + (1 − α − β)x0 ∈ C, since C is affine, and α + β + (1 − α − β) = 1. We conclude that αv1 + βv2 ∈ V , since αv1 + βv2 + x0 ∈ C. Thus, the affine set C can be expressed as C = V + x0 = {v + x0 | v ∈ V }, i.e., as a subspace plus an offset. The subspace V associated with the affine set C does not depend on the choice of x0 , so x0 can be chosen as any point in C. We define the dimension of an affine set C as the dimension of the subspace V = C −x0 , where x0 is any element of C. Example 2.1 Solution set of linear equations. The solution set of a system of linear equations, C = {x | Ax = b}, where A ∈ Rm×n and b ∈ Rm , is an affine set. To show this, suppose x1 , x2 ∈ C, i.e., Ax1 = b, Ax2 = b. Then for any θ, we have A(θx1 + (1 − θ)x2 ) = = = θAx1 + (1 − θ)Ax2 θb + (1 − θ)b b, which shows that the affine combination θx1 + (1 − θ)x2 is also in C. The subspace associated with the affine set C is the nullspace of A. We also have a converse: every affine set can be expressed as the solution set of a system of linear equations. 2.1 Affine and convex sets 23 The set of all affine combinations of points in some set C ⊆ Rn is called the affine hull of C, and denoted aff C: aff C = {θ1 x1 + · · · + θk xk | x1 , . . . , xk ∈ C, θ1 + · · · + θk = 1}. The affine hull is the smallest affine set that contains C, in the following sense: if S is any affine set with C ⊆ S, then aff C ⊆ S. 2.1.3 Affine dimension and relative interior We define the affine dimension of a set C as the dimension of its affine hull. Affine dimension is useful in the context of convex analysis and optimization, but is not always consistent with other definitions of dimension. As an example consider the unit circle in R2 , i.e., {x ∈ R2 | x21 + x22 = 1}. Its affine hull is all of R2 , so its affine dimension is two. By most definitions of dimension, however, the unit circle in R2 has dimension one. If the affine dimension of a set C ⊆ Rn is less than n, then the set lies in the affine set aff C 6= Rn . We define the relative interior of the set C, denoted relint C, as its interior relative to aff C: relint C = {x ∈ C | B(x, r) ∩ aff C ⊆ C for some r > 0}, where B(x, r) = {y | ky − xk ≤ r}, the ball of radius r and center x in the norm k · k. (Here k · k is any norm; all norms define the same relative interior.) We can then define the relative boundary of a set C as cl C \ relint C, where cl C is the closure of C. Example 2.2 Consider a square in the (x1 , x2 )-plane in R3 , defined as C = {x ∈ R3 | − 1 ≤ x1 ≤ 1, −1 ≤ x2 ≤ 1, x3 = 0}. Its affine hull is the (x1 , x2 )-plane, i.e., aff C = {x ∈ R3 | x3 = 0}. The interior of C is empty, but the relative interior is relint C = {x ∈ R3 | − 1 < x1 < 1, −1 < x2 < 1, x3 = 0}. Its boundary (in R3 ) is itself; its relative boundary is the wire-frame outline, {x ∈ R3 | max{|x1 |, |x2 |} = 1, x3 = 0}. 2.1.4 Convex sets A set C is convex if the line segment between any two points in C lies in C, i.e., if for any x1 , x2 ∈ C and any θ with 0 ≤ θ ≤ 1, we have θx1 + (1 − θ)x2 ∈ C. 24 2 Convex sets Figure 2.2 Some simple convex and nonconvex sets. Left. The hexagon, which includes its boundary (shown darker), is convex. Middle. The kidney shaped set is not convex, since the line segment between the two points in the set shown as dots is not contained in the set. Right. The square contains some boundary points but not others, and is not convex. Figure 2.3 The convex hulls of two sets in R2 . Left. The convex hull of a set of fifteen points (shown as dots) is the pentagon (shown shaded). Right. The convex hull of the kidney shaped set in figure 2.2 is the shaded set. Roughly speaking, a set is convex if every point in the set can be seen by every other point, along an unobstructed straight path between them, where unobstructed means lying in the set. Every affine set is also convex, since it contains the entire line between any two distinct points in it, and therefore also the line segment between the points. Figure 2.2 illustrates some simple convex and nonconvex sets in R2 . We call a point of the form θ1 x1 + · · · + θk xk , where θ1 + · · · + θk = 1 and θi ≥ 0, i = 1, . . . , k, a convex combination of the points x1 , . . . , xk . As with affine sets, it can be shown that a set is convex if and only if it contains every convex combination of its points. A convex combination of points can be thought of as a mixture or weighted average of the points, with θi the fraction of xi in the mixture. The convex hull of a set C, denoted conv C, is the set of all convex combinations of points in C: conv C = {θ1 x1 + · · · + θk xk | xi ∈ C, θi ≥ 0, i = 1, . . . , k, θ1 + · · · + θk = 1}. As the name suggests, the convex hull conv C is always convex. It is the smallest convex set that contains C: If B is any convex set that contains C, then conv C ⊆ B. Figure 2.3 illustrates the definition of convex hull. The idea of a convex combination can be generalized to include infinite sums, integrals, and, in the most general form, probability distributions. Suppose θ1 , θ2 , . . . 2.1 Affine and convex sets 25 satisfy θi ≥ 0, i = 1, 2, . . . , ∞ X θi = 1, i=1 and x1 , x2 , . . . ∈ C, where C ⊆ Rn is convex. Then ∞ X i=1 θi xi ∈ C, if the series converges. More generally, suppose p : Rn → R satisfies p(x) ≥ 0 for R all x ∈ C and C p(x) dx = 1, where C ⊆ Rn is convex. Then Z C p(x)x dx ∈ C, if the integral exists. In the most general form, suppose C ⊆ Rn is convex and x is a random vector with x ∈ C with probability one. Then E x ∈ C. Indeed, this form includes all the others as special cases. For example, suppose the random variable x only takes on the two values x1 and x2 , with prob(x = x1 ) = θ and prob(x = x2 ) = 1 − θ, where 0 ≤ θ ≤ 1. Then E x = θx1 + (1 − θ)x2 , and we are back to a simple convex combination of two points. 2.1.5 Cones A set C is called a cone, or nonnegative homogeneous, if for every x ∈ C and θ ≥ 0 we have θx ∈ C. A set C is a convex cone if it is convex and a cone, which means that for any x1 , x2 ∈ C and θ1 , θ2 ≥ 0, we have θ1 x1 + θ2 x2 ∈ C. Points of this form can be described geometrically as forming the two-dimensional pie slice with apex 0 and edges passing through x1 and x2 . (See figure 2.4.) A point of the form θ1 x1 + · · · + θk xk with θ1 , . . . , θk ≥ 0 is called a conic combination (or a nonnegative linear combination) of x1 , . . . , xk . If xi are in a convex cone C, then every conic combination of xi is in C. Conversely, a set C is a convex cone if and only if it contains all conic combinations of its elements. Like convex (or affine) combinations, the idea of conic combination can be generalized to infinite sums and integrals. The conic hull of a set C is the set of all conic combinations of points in C, i.e., {θ1 x1 + · · · + θk xk | xi ∈ C, θi ≥ 0, i = 1, . . . , k}, which is also the smallest convex cone that contains C (see figure 2.5). 26 2 Convex sets x1 x2 0 Figure 2.4 The pie slice shows all points of the form θ1 x1 + θ2 x2 , where θ1 , θ2 ≥ 0. The apex of the slice (which corresponds to θ1 = θ2 = 0) is at 0; its edges (which correspond to θ1 = 0 or θ2 = 0) pass through the points x1 and x2 . 0 0 Figure 2.5 The conic hulls (shown shaded) of the two sets of figure 2.3. 2.2 2.2 Some important examples 27 Some important examples In this section we describe some important examples of convex sets which we will encounter throughout the rest of the book. We start with some simple examples. • The empty set ∅, any single point (i.e., singleton) {x0 }, and the whole space Rn are affine (hence, convex) subsets of Rn . • Any line is affine. If it passes through zero, it is a subspace, hence also a convex cone. • A line segment is convex, but not affine (unless it reduces to a point). • A ray, which has the form {x0 + θv | θ ≥ 0}, where v 6= 0, is convex, but not affine. It is a convex cone if its base x0 is 0. • Any subspace is affine, and a convex cone (hence convex). 2.2.1 Hyperplanes and halfspaces A hyperplane is a set of the form {x | aT x = b}, where a ∈ Rn , a 6= 0, and b ∈ R. Analytically it is the solution set of a nontrivial linear equation among the components of x (and hence an affine set). Geometrically, the hyperplane {x | aT x = b} can be interpreted as the set of points with a constant inner product to a given vector a, or as a hyperplane with normal vector a; the constant b ∈ R determines the offset of the hyperplane from the origin. This geometric interpretation can be understood by expressing the hyperplane in the form {x | aT (x − x0 ) = 0}, where x0 is any point in the hyperplane (i.e., any point that satisfies aT x0 = b). This representation can in turn be expressed as {x | aT (x − x0 ) = 0} = x0 + a⊥ , where a⊥ denotes the orthogonal complement of a, i.e., the set of all vectors orthogonal to it: a⊥ = {v | aT v = 0}. This shows that the hyperplane consists of an offset x0 , plus all vectors orthogonal to the (normal) vector a. These geometric interpretations are illustrated in figure 2.6. A hyperplane divides Rn into two halfspaces. A (closed) halfspace is a set of the form {x | aT x ≤ b}, (2.1) where a 6= 0, i.e., the solution set of one (nontrivial) linear inequality. Halfspaces are convex, but not affine. This is illustrated in figure 2.7. 28 2 Convex sets a x0 x aT x = b Figure 2.6 Hyperplane in R2 , with normal vector a and a point x0 in the hyperplane. For any point x in the hyperplane, x − x0 (shown as the darker arrow) is orthogonal to a. a aT x ≥ b x0 aT x ≤ b Figure 2.7 A hyperplane defined by aT x = b in R2 determines two halfspaces. The halfspace determined by aT x ≥ b (not shaded) is the halfspace extending in the direction a. The halfspace determined by aT x ≤ b (which is shown shaded) extends in the direction −a. The vector a is the outward normal of this halfspace. 2.2 Some important examples 29 x1 a x0 x2 Figure 2.8 The shaded set is the halfspace determined by aT (x − x0 ) ≤ 0. The vector x1 −x0 makes an acute angle with a, so x1 is not in the halfspace. The vector x2 − x0 makes an obtuse angle with a, and so is in the halfspace. The halfspace (2.1) can also be expressed as {x | aT (x − x0 ) ≤ 0}, (2.2) where x0 is any point on the associated hyperplane, i.e., satisfies aT x0 = b. The representation (2.2) suggests a simple geometric interpretation: the halfspace consists of x0 plus any vector that makes an obtuse (or right) angle with the (outward normal) vector a. This is illustrated in figure 2.8. The boundary of the halfspace (2.1) is the hyperplane {x | aT x = b}. The set {x | aT x < b}, which is the interior of the halfspace {x | aT x ≤ b}, is called an open halfspace. 2.2.2 Euclidean balls and ellipsoids A (Euclidean) ball (or just ball) in Rn has the form B(xc , r) = {x | kx − xc k2 ≤ r} = {x | (x − xc )T (x − xc ) ≤ r2 }, where r > 0, and k · k2 denotes the Euclidean norm, i.e., kuk2 = (uT u)1/2 . The vector xc is the center of the ball and the scalar r is its radius; B(xc , r) consists of all points within a distance r of the center xc . Another common representation for the Euclidean ball is B(xc , r) = {xc + ru | kuk2 ≤ 1}. 30 2 Convex sets xc Figure 2.9 An ellipsoid in R2 , shown shaded. The center xc is shown as a dot, and the two semi-axes are shown as line segments. A Euclidean ball is a convex set: if kx1 − xc k2 ≤ r, kx2 − xc k2 ≤ r, and 0 ≤ θ ≤ 1, then kθx1 + (1 − θ)x2 − xc k2 = kθ(x1 − xc ) + (1 − θ)(x2 − xc )k2 ≤ θkx1 − xc k2 + (1 − θ)kx2 − xc k2 ≤ r. (Here we use the homogeneity property and triangle inequality for k·k2 ; see §A.1.2.) A related family of convex sets is the ellipsoids, which have the form E = {x | (x − xc )T P −1 (x − xc ) ≤ 1}, (2.3) E = {xc + Au | kuk2 ≤ 1}, (2.4) where P = P T ≻ 0, i.e., P is symmetric and positive definite. The vector xc ∈ Rn is the center of the ellipsoid. The matrix P determines how far the ellipsoid √ extends in every direction from xc ; the lengths of the semi-axes of E are given by λi , where λi are the eigenvalues of P . A ball is an ellipsoid with P = r2 I. Figure 2.9 shows an ellipsoid in R2 . Another common representation of an ellipsoid is where A is square and nonsingular. In this representation we can assume without loss of generality that A is symmetric and positive definite. By taking A = P 1/2 , this representation gives the ellipsoid defined in (2.3). When the matrix A in (2.4) is symmetric positive semidefinite but singular, the set in (2.4) is called a degenerate ellipsoid ; its affine dimension is equal to the rank of A. Degenerate ellipsoids are also convex. 2.2.3 Norm balls and norm cones Suppose k·k is any norm on Rn (see §A.1.2). From the general properties of norms it can be shown that a norm ball of radius r and center xc , given by {x | kx−xc k ≤ r}, is convex. The norm cone associated with the norm k · k is the set C = {(x, t) | kxk ≤ t} ⊆ Rn+1 . 2.2 Some important examples 31 t 1 0.5 0 1 1 0 0 −1 −1 x2 x1 Figure 2.10 Boundary of second-order cone in R3 , {(x1 , x2 , t) | (x21 +x22 )1/2 ≤ t}. It is (as the name suggests) a convex cone. Example 2.3 The second-order cone is the norm cone for the Euclidean norm, i.e., C = = {(x, t) ∈ Rn+1 | kxk2 ≤ t} ( x t x t T I 0 0 −1 x t ) ≤ 0, t ≥ 0 . The second-order cone is also known by several other names. It is called the quadratic cone, since it is defined by a quadratic inequality. It is also called the Lorentz cone or ice-cream cone. Figure 2.10 shows the second-order cone in R3 . 2.2.4 Polyhedra A polyhedron is defined as the solution set of a finite number of linear equalities and inequalities: P = {x | aTj x ≤ bj , j = 1, . . . , m, cTj x = dj , j = 1, . . . , p}. (2.5) A polyhedron is thus the intersection of a finite number of halfspaces and hyperplanes. Affine sets (e.g., subspaces, hyperplanes, lines), rays, line segments, and halfspaces are all polyhedra. It is easily shown that polyhedra are convex sets. A bounded polyhedron is sometimes called a polytope, but some authors use the opposite convention (i.e., polytope for any set of the form (2.5), and polyhedron 32 2 a1 Convex sets a2 P a5 a3 a4 Figure 2.11 The polyhedron P (shown shaded) is the intersection of five halfspaces, with outward normal vectors a1 , . . . . , a5 . when it is bounded). Figure 2.11 shows an example of a polyhedron defined as the intersection of five halfspaces. It will be convenient to use the compact notation P = {x | Ax b, Cx = d} for (2.5), where aT1 A = ... , aTm (2.6) cT1 C = ... , cTp and the symbol denotes vector inequality or componentwise inequality in Rm : u v means ui ≤ vi for i = 1, . . . , m. Example 2.4 The nonnegative orthant is the set of points with nonnegative components, i.e., n n Rn + = {x ∈ R | xi ≥ 0, i = 1, . . . , n} = {x ∈ R | x 0}. (Here R+ denotes the set of nonnegative numbers: R+ = {x ∈ R | x ≥ 0}.) The nonnegative orthant is a polyhedron and a cone (and therefore called a polyhedral cone). Simplexes Simplexes are another important family of polyhedra. Suppose the k + 1 points v0 , . . . , vk ∈ Rn are affinely independent, which means v1 − v0 , . . . , vk − v0 are linearly independent. The simplex determined by them is given by C = conv{v0 , . . . , vk } = {θ0 v0 + · · · + θk vk | θ 0, 1T θ = 1}, (2.7) 2.2 Some important examples 33 where 1 denotes the vector with all entries one. The affine dimension of this simplex is k, so it is sometimes referred to as a k-dimensional simplex in Rn . Example 2.5 Some common simplexes. A 1-dimensional simplex is a line segment; a 2-dimensional simplex is a triangle (including its interior); and a 3-dimensional simplex is a tetrahedron. The unit simplex is the n-dimensional simplex determined by the zero vector and the unit vectors, i.e., 0, e1 , . . . , en ∈ Rn . It can be expressed as the set of vectors that satisfy x 0, 1T x ≤ 1. The probability simplex is the (n − 1)-dimensional simplex determined by the unit vectors e1 , . . . , en ∈ Rn . It is the set of vectors that satisfy 1T x = 1. x 0, Vectors in the probability simplex correspond to probability distributions on a set with n elements, with xi interpreted as the probability of the ith element. To describe the simplex (2.7) as a polyhedron, i.e., in the form (2.6), we proceed as follows. By definition, x ∈ C if and only if x = θ0 v0 + θ1 v1 + · · · + θk vk for some θ 0 with 1T θ = 1. Equivalently, if we define y = (θ1 , . . . , θk ) and B= v1 − v0 ··· vk − v 0 we can say that x ∈ C if and only if ∈ Rn×k , x = v0 + By (2.8) for some y 0 with 1T y ≤ 1. Now we note that affine independence of the points v0 , . . . , vk implies that the matrix B has rank k. Therefore there exists a nonsingular matrix A = (A1 , A2 ) ∈ Rn×n such that AB = A1 A2 B= I 0 . Multiplying (2.8) on the left with A, we obtain A1 x = A1 v0 + y, A2 x = A2 v0 . From this we see that x ∈ C if and only if A2 x = A2 v0 , and the vector y = A1 x − A1 v0 satisfies y 0 and 1T y ≤ 1. In other words we have x ∈ C if and only if A2 x = A2 v0 , A1 x A1 v0 , 1T A1 x ≤ 1 + 1T A1 v0 , which is a set of linear equalities and inequalities in x, and so describes a polyhedron. 34 2 Convex sets Convex hull description of polyhedra The convex hull of the finite set {v1 , . . . , vk } is conv{v1 , . . . , vk } = {θ1 v1 + · · · + θk vk | θ 0, 1T θ = 1}. This set is a polyhedron, and bounded, but (except in special cases, e.g., a simplex) it is not simple to express it in the form (2.5), i.e., by a set of linear equalities and inequalities. A generalization of this convex hull description is {θ1 v1 + · · · + θk vk | θ1 + · · · + θm = 1, θi ≥ 0, i = 1, . . . , k}, (2.9) where m ≤ k. Here we consider nonnegative linear combinations of vi , but only the first m coefficients are required to sum to one. Alternatively, we can interpret (2.9) as the convex hull of the points v1 , . . . , vm , plus the conic hull of the points vm+1 , . . . , vk . The set (2.9) defines a polyhedron, and conversely, every polyhedron can be represented in this form (although we will not show this). The question of how a polyhedron is represented is subtle, and has very important practical consequences. As a simple example consider the unit ball in the ℓ∞ -norm in Rn , C = {x | |xi | ≤ 1, i = 1, . . . , n}. The set C can be described in the form (2.5) with 2n linear inequalities ±eTi x ≤ 1, where ei is the ith unit vector. To describe it in the convex hull form (2.9) requires at least 2n points: C = conv{v1 , . . . , v2n }, where v1 , . . . , v2n are the 2n vectors all of whose components are 1 or −1. Thus the size of the two descriptions differs greatly, for large n. 2.2.5 The positive semidefinite cone We use the notation Sn to denote the set of symmetric n × n matrices, Sn = {X ∈ Rn×n | X = X T }, which is a vector space with dimension n(n + 1)/2. We use the notation Sn+ to denote the set of symmetric positive semidefinite matrices: Sn+ = {X ∈ Sn | X 0}, and the notation Sn++ to denote the set of symmetric positive definite matrices: Sn++ = {X ∈ Sn | X ≻ 0}. (This notation is meant to be analogous to R+ , which denotes the nonnegative reals, and R++ , which denotes the positive reals.) 2.3 Operations that preserve convexity 35 z 1 0.5 0 1 1 0 0.5 y −1 0 x Figure 2.12 Boundary of positive semidefinite cone in S2 . The set Sn+ is a convex cone: if θ1 , θ2 ≥ 0 and A, B ∈ Sn+ , then θ1 A+θ2 B ∈ Sn+ . This can be seen directly from the definition of positive semidefiniteness: for any x ∈ Rn , we have xT (θ1 A + θ2 B)x = θ1 xT Ax + θ2 xT Bx ≥ 0, if A 0, B 0 and θ1 , θ2 ≥ 0. Example 2.6 Positive semidefinite cone in S2 . We have X= x y y z ∈ S2+ ⇐⇒ x ≥ 0, z ≥ 0, xz ≥ y 2 . The boundary of this cone is shown in figure 2.12, plotted in R3 as (x, y, z). 2.3 Operations that preserve convexity In this section we describe some operations that preserve convexity of sets, or allow us to construct convex sets from others. These operations, together with the simple examples described in §2.2, form a calculus of convex sets that is useful for determining or establishing convexity of sets. 36 2.3.1 2 Convex sets Intersection Convexity is preserved under intersection: if S1 and S2 are convex, then S1 ∩ S2 is convex. This property extends to the T intersection of an infinite number of sets: if Sα is convex for every α ∈ A, then α∈A Sα is convex. (Subspaces, affine sets, and convex cones are also closed under arbitrary intersections.) As a simple example, a polyhedron is the intersection of halfspaces and hyperplanes (which are convex), and therefore is convex. Example 2.7 The positive semidefinite cone Sn + can be expressed as \ z6=0 {X ∈ Sn | z T Xz ≥ 0}. For each z 6= 0, z T Xz is a (not identically zero) linear function of X, so the sets {X ∈ Sn | z T Xz ≥ 0} are, in fact, halfspaces in Sn . Thus the positive semidefinite cone is the intersection of an infinite number of halfspaces, and so is convex. Example 2.8 We consider the set Pm S = {x ∈ Rm | |p(t)| ≤ 1 for |t| ≤ π/3}, (2.10) where p(t) = x cos kt. TThe set S can be expressed as the intersection of an k=1 k infinite number of slabs: S = |t|≤π/3 St , where St = {x | − 1 ≤ (cos t, . . . , cos mt)T x ≤ 1}, and so is convex. The definition and the set are illustrated in figures 2.13 and 2.14, for m = 2. In the examples above we establish convexity of a set by expressing it as a (possibly infinite) intersection of halfspaces. We will see in §2.5.1 that a converse holds: every closed convex set S is a (usually infinite) intersection of halfspaces. In fact, a closed convex set S is the intersection of all halfspaces that contain it: \ S= {H | H halfspace, S ⊆ H}. 2.3.2 Affine functions Recall that a function f : Rn → Rm is affine if it is a sum of a linear function and a constant, i.e., if it has the form f (x) = Ax + b, where A ∈ Rm×n and b ∈ Rm . Suppose S ⊆ Rn is convex and f : Rn → Rm is an affine function. Then the image of S under f , f (S) = {f (x) | x ∈ S}, Operations that preserve convexity 37 p(t) 1 0 −1 π/3 0 t 2π/3 π Figure 2.13 Three trigonometric polynomials associated with points in the set S defined in (2.10), for m = 2. The trigonometric polynomial plotted with dashed line type is the average of the other two. 2 1 x2 2.3 S 0 −1 −2 −2 −1 0 x1 1 2 Figure 2.14 The set S defined in (2.10), for m = 2, is shown as the white area in the middle of the plot. The set is the intersection of an infinite number of slabs (20 of which are shown), hence convex. 38 2 Convex sets is convex. Similarly, if f : Rk → Rn is an affine function, the inverse image of S under f , f −1 (S) = {x | f (x) ∈ S}, is convex. Two simple examples are scaling and translation. If S ⊆ Rn is convex, α ∈ R, and a ∈ Rn , then the sets αS and S + a are convex, where αS = {αx | x ∈ S}, S + a = {x + a | x ∈ S}. The projection of a convex set onto some of its coordinates is convex: if S ⊆ Rm × Rn is convex, then T = {x1 ∈ Rm | (x1 , x2 ) ∈ S for some x2 ∈ Rn } is convex. The sum of two sets is defined as S1 + S2 = {x + y | x ∈ S1 , y ∈ S2 }. If S1 and S2 are convex, then S1 + S2 is convex. To see this, if S1 and S2 are convex, then so is the direct or Cartesian product S1 × S2 = {(x1 , x2 ) | x1 ∈ S1 , x2 ∈ S2 }. The image of this set under the linear function f (x1 , x2 ) = x1 + x2 is the sum S1 + S2 . We can also consider the partial sum of S1 , S2 ∈ Rn × Rm , defined as n S = {(x, y1 + y2 ) | (x, y1 ) ∈ S1 , (x, y2 ) ∈ S2 }, where x ∈ R and yi ∈ Rm . For m = 0, the partial sum gives the intersection of S1 and S2 ; for n = 0, it is set addition. Partial sums of convex sets are convex (see exercise 2.16). Example 2.9 Polyhedron. The polyhedron {x | Ax b, Cx = d} can be expressed as the inverse image of the Cartesian product of the nonnegative orthant and the origin under the affine function f (x) = (b − Ax, d − Cx): {x | Ax b, Cx = d} = {x | f (x) ∈ Rm + × {0}}. Example 2.10 Solution set of linear matrix inequality. The condition A(x) = x1 A1 + · · · + xn An B, (2.11) where B, Ai ∈ Sm , is called a linear matrix inequality (LMI) in x (Note the similarity to an ordinary linear inequality, aT x = x1 a1 + · · · + xn an ≤ b, with b, ai ∈ R.) The solution set of a linear matrix inequality, {x | A(x) B}, is convex. Indeed, it is the inverse image of the positive semidefinite cone under the affine function f : Rn → Sm given by f (x) = B − A(x). 2.3 Operations that preserve convexity Example 2.11 Hyperbolic cone. The set {x | xT P x ≤ (cT x)2 , cT x ≥ 0} n where P ∈ Sn + and c ∈ R , is convex, since it is the inverse image of the second-order cone, {(z, t) | z T z ≤ t2 , t ≥ 0}, under the affine function f (x) = (P 1/2 x, cT x). Example 2.12 Ellipsoid. The ellipsoid E = {x | (x − xc )T P −1 (x − xc ) ≤ 1}, where P ∈ Sn ++ , is the image of the unit Euclidean ball {u | kuk2 ≤ 1} under the affine mapping f (u) = P 1/2 u + xc . (It is also the inverse image of the unit ball under the affine mapping g(x) = P −1/2 (x − xc ).) 2.3.3 Linear-fractional and perspective functions In this section we explore a class of functions, called linear-fractional, that is more general than affine but still preserves convexity. The perspective function We define the perspective function P : Rn+1 → Rn , with domain dom P = Rn × R++ , as P (z, t) = z/t. (Here R++ denotes the set of positive numbers: R++ = {x ∈ R | x > 0}.) The perspective function scales or normalizes vectors so the last component is one, and then drops the last component. Remark 2.1 We can interpret the perspective function as the action of a pin-hole camera. A pin-hole camera (in R3 ) consists of an opaque horizontal plane x3 = 0, with a single pin-hole at the origin, through which light can pass, and a horizontal image plane x3 = −1. An object at x, above the camera (i.e., with x3 > 0), forms an image at the point −(x1 /x3 , x2 /x3 , 1) on the image plane. Dropping the last component of the image point (since it is always −1), the image of a point at x appears at y = −(x1 /x3 , x2 /x3 ) = −P (x) on the image plane. This is illustrated in figure 2.15. If C ⊆ dom P is convex, then its image P (C) = {P (x) | x ∈ C} is convex. This result is certainly intuitive: a convex object, viewed through a pin-hole camera, yields a convex image. To establish this fact we show that line segments are mapped to line segments under the perspective function. (This too 39 40 2 Convex sets x3 = 0 x3 = −1 Figure 2.15 Pin-hole camera interpretation of perspective function. The dark horizontal line represents the plane x3 = 0 in R3 , which is opaque, except for a pin-hole at the origin. Objects or light sources above the plane appear on the image plane x3 = −1, which is shown as the lighter horizontal line. The mapping of the position of a source to the position of its image is related to the perspective function. makes sense: a line segment, viewed through a pin-hole camera, yields a line segment image.) Suppose that x = (x̃, xn+1 ), y = (ỹ, yn+1 ) ∈ Rn+1 with xn+1 > 0, yn+1 > 0. Then for 0 ≤ θ ≤ 1, P (θx + (1 − θ)y) = where µ= θx̃ + (1 − θ)ỹ = µP (x) + (1 − µ)P (y), θxn+1 + (1 − θ)yn+1 θxn+1 ∈ [0, 1]. θxn+1 + (1 − θ)yn+1 This correspondence between θ and µ is monotonic: as θ varies between 0 and 1 (which sweeps out the line segment [x, y]), µ varies between 0 and 1 (which sweeps out the line segment [P (x), P (y)]). This shows that P ([x, y]) = [P (x), P (y)]. Now suppose C is convex with C ⊆ dom P (i.e., xn+1 > 0 for all x ∈ C), and x, y ∈ C. To establish convexity of P (C) we need to show that the line segment [P (x), P (y)] is in P (C). But this line segment is the image of the line segment [x, y] under P , and so lies in P (C). The inverse image of a convex set under the perspective function is also convex: if C ⊆ Rn is convex, then P −1 (C) = {(x, t) ∈ Rn+1 | x/t ∈ C, t > 0} is convex. To show this, suppose (x, t) ∈ P −1 (C), (y, s) ∈ P −1 (C), and 0 ≤ θ ≤ 1. We need to show that θ(x, t) + (1 − θ)(y, s) ∈ P −1 (C), i.e., that θx + (1 − θ)y ∈C θt + (1 − θ)s 2.3 Operations that preserve convexity 41 (θt + (1 − θ)s > 0 is obvious). This follows from θx + (1 − θ)y = µ(x/t) + (1 − µ)(y/s), θt + (1 − θ)s where µ= θt ∈ [0, 1]. θt + (1 − θ)s Linear-fractional functions A linear-fractional function is formed by composing the perspective function with an affine function. Suppose g : Rn → Rm+1 is affine, i.e., b A , (2.12) x+ g(x) = d cT where A ∈ Rm×n , b ∈ Rm , c ∈ Rn , and d ∈ R. The function f : Rn → Rm given by f = P ◦ g, i.e., f (x) = (Ax + b)/(cT x + d), dom f = {x | cT x + d > 0}, (2.13) is called a linear-fractional (or projective) function. If c = 0 and d > 0, the domain of f is Rn , and f is an affine function. So we can think of affine and linear functions as special cases of linear-fractional functions. Remark 2.2 Projective interpretation. It is often convenient to represent a linearfractional function as a matrix Q= A cT b d ∈ R(m+1)×(n+1) (2.14) that acts on (multiplies) points of form (x, 1), which yields (Ax + b, cT x + d). This result is then scaled or normalized so that its last component is one, which yields (f (x), 1). This representation can be interpreted geometrically by associating Rn with a set of rays in Rn+1 as follows. With each point z in Rn we associate the (open) ray P(z) = {t(z, 1) | t > 0} in Rn+1 . The last component of this ray takes on positive values. Conversely any ray in Rn+1 , with base at the origin and last component which takes on positive values, can be written as P(v) = {t(v, 1) | t ≥ 0} for some v ∈ Rn . This (projective) correspondence P between Rn and the halfspace of rays with positive last component is one-to-one and onto. The linear-fractional function (2.13) can be expressed as f (x) = P −1 (QP(x)). Thus, we start with x ∈ dom f , i.e., cT x + d > 0. We then form the ray P(x) in Rn+1 . The linear transformation with matrix Q acts on this ray to produce another ray QP(x). Since x ∈ dom f , the last component of this ray assumes positive values. Finally we take the inverse projective transformation to recover f (x). 42 2 1 0 C −1 −1 0 x1 x2 1 x2 Convex sets 1 0 −1 −1 f (C) 0 x1 1 Figure 2.16 Left. A set C ⊆ R2 . The dashed line shows the boundary of the domain of the linear-fractional function f (x) = x/(x1 + x2 + 1) with dom f = {(x1 , x2 ) | x1 + x2 + 1 > 0}. Right. Image of C under f . The dashed line shows the boundary of the domain of f −1 . Like the perspective function, linear-fractional functions preserve convexity. If C is convex and lies in the domain of f (i.e., cT x + d > 0 for x ∈ C), then its image f (C) is convex. This follows immediately from results above: the image of C under the affine mapping (2.12) is convex, and the image of the resulting set under the perspective function P , which yields f (C), is convex. Similarly, if C ⊆ Rm is convex, then the inverse image f −1 (C) is convex. Example 2.13 Conditional probabilities. Suppose u and v are random variables that take on values in {1, . . . , n} and {1, . . . , m}, respectively, and let pij denote prob(u = i, v = j). Then the conditional probability fij = prob(u = i|v = j) is given by pij . fij = Pn p k=1 kj Thus f is obtained by a linear-fractional mapping from p. It follows that if C is a convex set of joint probabilities for (u, v), then the associated set of conditional probabilities of u given v is also convex. Figure 2.16 shows a set C ⊆ R2 , and its image under the linear-fractional function f (x) = 1 x, x1 + x2 + 1 dom f = {(x1 , x2 ) | x1 + x2 + 1 > 0}. 2.4 Generalized inequalities 2.4 Generalized inequalities 2.4.1 Proper cones and generalized inequalities A cone K ⊆ Rn is called a proper cone if it satisfies the following: • K is convex. • K is closed. • K is solid, which means it has nonempty interior. • K is pointed, which means that it contains no line (or equivalently, x ∈ K, − x ∈ K =⇒ x = 0). A proper cone K can be used to define a generalized inequality, which is a partial ordering on Rn that has many of the properties of the standard ordering on R. We associate with the proper cone K the partial ordering on Rn defined by x K y ⇐⇒ y − x ∈ K. We also write x K y for y K x. Similarly, we define an associated strict partial ordering by x ≺K y ⇐⇒ y − x ∈ int K, and write x ≻K y for y ≺K x. (To distinguish the generalized inequality K from the strict generalized inequality, we sometimes refer to K as the nonstrict generalized inequality.) When K = R+ , the partial ordering K is the usual ordering ≤ on R, and the strict partial ordering ≺K is the same as the usual strict ordering < on R. So generalized inequalities include as a special case ordinary (nonstrict and strict) inequality in R. Example 2.14 Nonnegative orthant and componentwise inequality. The nonnegative orthant K = Rn + is a proper cone. The associated generalized inequality K corresponds to componentwise inequality between vectors: x K y means that xi ≤ yi , i = 1, . . . , n. The associated strict inequality corresponds to componentwise strict inequality: x ≺K y means that xi < yi , i = 1, . . . , n. The nonstrict and strict partial orderings associated with the nonnegative orthant arise so frequently that we drop the subscript Rn + ; it is understood when the symbol or ≺ appears between vectors. Example 2.15 Positive semidefinite cone and matrix inequality. The positive semidefn inite cone Sn + is a proper cone in S . The associated generalized inequality K is the usual matrix inequality: X K Y means Y − X is positive semidefinite. The inten rior of Sn + (in S ) consists of the positive definite matrices, so the strict generalized inequality also agrees with the usual strict inequality between symmetric matrices: X ≺K Y means Y − X is positive definite. Here, too, the partial ordering arises so frequently that we drop the subscript: for symmetric matrices we write simply X Y or X ≺ Y . It is understood that the generalized inequalities are with respect to the positive semidefinite cone. 43 44 2 Convex sets Example 2.16 Cone of polynomials nonnegative on [0, 1]. Let K be defined as K = {c ∈ Rn | c1 + c2 t + · · · + cn tn−1 ≥ 0 for t ∈ [0, 1]}, (2.15) i.e., K is the cone of (coefficients of) polynomials of degree n−1 that are nonnegative on the interval [0, 1]. It can be shown that K is a proper cone; its interior is the set of coefficients of polynomials that are positive on the interval [0, 1]. Two vectors c, d ∈ Rn satisfy c K d if and only if c1 + c2 t + · · · + cn tn−1 ≤ d1 + d2 t + · · · + dn tn−1 for all t ∈ [0, 1]. Properties of generalized inequalities A generalized inequality K satisfies many properties, such as • K is preserved under addition: if x K y and u K v, then x + u K y + v. • K is transitive: if x K y and y K z then x K z. • K is preserved under nonnegative scaling: if x K y and α ≥ 0 then αx K αy. • K is reflexive: x K x. • K is antisymmetric: if x K y and y K x, then x = y. • K is preserved under limits: if xi K yi for i = 1, 2, . . ., xi → x and yi → y as i → ∞, then x K y. The corresponding strict generalized inequality ≺K satisfies, for example, • if x ≺K y then x K y. • if x ≺K y and u K v then x + u ≺K y + v. • if x ≺K y and α > 0 then αx ≺K αy. • x 6≺K x. • if x ≺K y, then for u and v small enough, x + u ≺K y + v. These properties are inherited from the definitions of K and ≺K , and the properties of proper cones; see exercise 2.30. 2.4 2.4.2 Generalized inequalities 45 Minimum and minimal elements The notation of generalized inequality (i.e., K , ≺K ) is meant to suggest the analogy to ordinary inequality on R (i.e., ≤, <). While many properties of ordinary inequality do hold for generalized inequalities, some important ones do not. The most obvious difference is that ≤ on R is a linear ordering: any two points are comparable, meaning either x ≤ y or y ≤ x. This property does not hold for other generalized inequalities. One implication is that concepts like minimum and maximum are more complicated in the context of generalized inequalities. We briefly discuss this in this section. We say that x ∈ S is the minimum element of S (with respect to the generalized inequality K ) if for every y ∈ S we have x K y. We define the maximum element of a set S, with respect to a generalized inequality, in a similar way. If a set has a minimum (maximum) element, then it is unique. A related concept is minimal element. We say that x ∈ S is a minimal element of S (with respect to the generalized inequality K ) if y ∈ S, y K x only if y = x. We define maximal element in a similar way. A set can have many different minimal (maximal) elements. We can describe minimum and minimal elements using simple set notation. A point x ∈ S is the minimum element of S if and only if S ⊆ x + K. Here x + K denotes all the points that are comparable to x and greater than or equal to x (according to K ). A point x ∈ S is a minimal element if and only if (x − K) ∩ S = {x}. Here x − K denotes all the points that are comparable to x and less than or equal to x (according to K ); the only point in common with S is x. For K = R+ , which induces the usual ordering on R, the concepts of minimal and minimum are the same, and agree with the usual definition of the minimum element of a set. Example 2.17 Consider the cone R2+ , which induces componentwise inequality in R2 . Here we can give some simple geometric descriptions of minimal and minimum elements. The inequality x y means y is above and to the right of x. To say that x ∈ S is the minimum element of a set S means that all other points of S lie above and to the right. To say that x is a minimal element of a set S means that no other point of S lies to the left and below x. This is illustrated in figure 2.17. Example 2.18 Minimum and minimal elements of a set of symmetric matrices. We associate with each A ∈ Sn ++ an ellipsoid centered at the origin, given by EA = {x | xT A−1 x ≤ 1}. We have A B if and only if EA ⊆ EB . Let v1 , . . . , vk ∈ Rn be given and define T −1 S = {P ∈ Sn vi ≤ 1, i = 1, . . . , k}, ++ | vi P 46 2 Convex sets S2 S1 x2 x1 Figure 2.17 Left. The set S1 has a minimum element x1 with respect to componentwise inequality in R2 . The set x1 + K is shaded lightly; x1 is the minimum element of S1 since S1 ⊆ x1 + K. Right. The point x2 is a minimal point of S2 . The set x2 − K is shown lightly shaded. The point x2 is minimal because x2 − K and S2 intersect only at x2 . which corresponds to the set of ellipsoids that contain the points v1 , . . . , vk . The set S does not have a minimum element: for any ellipsoid that contains the points v1 , . . . , vk we can find another one that contains the points, and is not comparable to it. An ellipsoid is minimal if it contains the points, but no smaller ellipsoid does. Figure 2.18 shows an example in R2 with k = 2. 2.5 Separating and supporting hyperplanes 2.5.1 Separating hyperplane theorem In this section we describe an idea that will be important later: the use of hyperplanes or affine functions to separate convex sets that do not intersect. The basic result is the separating hyperplane theorem: Suppose C and D are two convex sets that do not intersect, i.e., C ∩ D = ∅. Then there exist a 6= 0 and b such that aT x ≤ b for all x ∈ C and aT x ≥ b for all x ∈ D. In other words, the affine function aT x − b is nonpositive on C and nonnegative on D. The hyperplane {x | aT x = b} is called a separating hyperplane for the sets C and D, or is said to separate the sets C and D. This is illustrated in figure 2.19. Proof of separating hyperplane theorem Here we consider a special case, and leave the extension of the proof to the general case as an exercise (exercise 2.22). We assume that the (Euclidean) distance between C and D, defined as dist(C, D) = inf{ku − vk2 | u ∈ C, v ∈ D}, 2.5 Separating and supporting hyperplanes 47 E2 E1 E3 Figure 2.18 Three ellipsoids in R2 , centered at the origin (shown as the lower dot), that contain the points shown as the upper dots. The ellipsoid E1 is not minimal, since there exist ellipsoids that contain the points, and are smaller (e.g., E3 ). E3 is not minimal for the same reason. The ellipsoid E2 is minimal, since no other ellipsoid (centered at the origin) contains the points and is contained in E2 . aT x ≤ b aT x ≥ b D C a Figure 2.19 The hyperplane {x | aT x = b} separates the disjoint convex sets C and D. The affine function aT x − b is nonpositive on C and nonnegative on D. 48 2 Convex sets a C c d D Figure 2.20 Construction of a separating hyperplane between two convex sets. The points c ∈ C and d ∈ D are the pair of points in the two sets that are closest to each other. The separating hyperplane is orthogonal to, and bisects, the line segment between c and d. is positive, and that there exist points c ∈ C and d ∈ D that achieve the minimum distance, i.e., kc − dk2 = dist(C, D). (These conditions are satisfied, for example, when C and D are closed and one set is bounded.) Define kdk22 − kck22 . a = d − c, b= 2 We will show that the affine function f (x) = aT x − b = (d − c)T (x − (1/2)(d + c)) is nonpositive on C and nonnegative on D, i.e., that the hyperplane {x | aT x = b} separates C and D. This hyperplane is perpendicular to the line segment between c and d, and passes through its midpoint, as shown in figure 2.20. We first show that f is nonnegative on D. The proof that f is nonpositive on C is similar (or follows by swapping C and D and considering −f ). Suppose there were a point u ∈ D for which f (u) = (d − c)T (u − (1/2)(d + c)) < 0. (2.16) We can express f (u) as f (u) = (d − c)T (u − d + (1/2)(d − c)) = (d − c)T (u − d) + (1/2)kd − ck22 . We see that (2.16) implies (d − c)T (u − d) < 0. Now we observe that d kd + t(u − d) − ck22 = 2(d − c)T (u − d) < 0, dt t=0 so for some small t > 0, with t ≤ 1, we have kd + t(u − d) − ck2 < kd − ck2 , 2.5 Separating and supporting hyperplanes i.e., the point d + t(u − d) is closer to c than d is. Since D is convex and contains d and u, we have d + t(u − d) ∈ D. But this is impossible, since d is assumed to be the point in D that is closest to C. Example 2.19 Separation of an affine and a convex set. Suppose C is convex and D is affine, i.e., D = {F u + g | u ∈ Rm }, where F ∈ Rn×m . Suppose C and D are disjoint, so by the separating hyperplane theorem there are a 6= 0 and b such that aT x ≤ b for all x ∈ C and aT x ≥ b for all x ∈ D. Now aT x ≥ b for all x ∈ D means aT F u ≥ b − aT g for all u ∈ Rm . But a linear function is bounded below on Rm only when it is zero, so we conclude aT F = 0 (and hence, b ≤ aT g). Thus we conclude that there exists a 6= 0 such that F T a = 0 and aT x ≤ aT g for all x ∈ C. Strict separation The separating hyperplane we constructed above satisfies the stronger condition that aT x < b for all x ∈ C and aT x > b for all x ∈ D. This is called strict separation of the sets C and D. Simple examples show that in general, disjoint convex sets need not be strictly separable by a hyperplane (even when the sets are closed; see exercise 2.23). In many special cases, however, strict separation can be established. Example 2.20 Strict separation of a point and a closed convex set. Let C be a closed convex set and x0 6∈ C. Then there exists a hyperplane that strictly separates x0 from C. To see this, note that the two sets C and B(x0 , ǫ) do not intersect for some ǫ > 0. By the separating hyperplane theorem, there exist a 6= 0 and b such that aT x ≤ b for x ∈ C and aT x ≥ b for x ∈ B(x0 , ǫ). Using B(x0 , ǫ) = {x0 + u | kuk2 ≤ ǫ}, the second condition can be expressed as aT (x0 + u) ≥ b for all kuk2 ≤ ǫ. The u that minimizes the lefthand side is u = −ǫa/kak2 ; using this value we have aT x0 − ǫkak2 ≥ b. Therefore the affine function f (x) = aT x − b − ǫkak2 /2 is negative on C and positive at x0 . As an immediate consequence we can establish a fact that we already mentioned above: a closed convex set is the intersection of all halfspaces that contain it. Indeed, let C be closed and convex, and let S be the intersection of all halfspaces containing C. Obviously x ∈ C ⇒ x ∈ S. To show the converse, suppose there exists x ∈ S, x 6∈ C. By the strict separation result there exists a hyperplane that strictly separates x from C, i.e., there is a halfspace containing C but not x. In other words, x 6∈ S. 49 50 2 Convex sets Converse separating hyperplane theorems The converse of the separating hyperplane theorem (i.e., existence of a separating hyperplane implies that C and D do not intersect) is not true, unless one imposes additional constraints on C or D, even beyond convexity. As a simple counterexample, consider C = D = {0} ⊆ R. Here the hyperplane x = 0 separates C and D. By adding conditions on C and D various converse separation theorems can be derived. As a very simple example, suppose C and D are convex sets, with C open, and there exists an affine function f that is nonpositive on C and nonnegative on D. Then C and D are disjoint. (To see this we first note that f must be negative on C; for if f were zero at a point of C then f would take on positive values near the point, which is a contradiction. But then C and D must be disjoint since f is negative on C and nonnegative on D.) Putting this converse together with the separating hyperplane theorem, we have the following result: any two convex sets C and D, at least one of which is open, are disjoint if and only if there exists a separating hyperplane. Example 2.21 Theorem of alternatives for strict linear inequalities. We derive the necessary and sufficient conditions for solvability of a system of strict linear inequalities Ax ≺ b. (2.17) These inequalities are infeasible if and only if the (convex) sets C = {b − Ax | x ∈ Rn }, m D = Rm | y ≻ 0} ++ = {y ∈ R do not intersect. The set D is open; C is an affine set. Hence by the result above, C and D are disjoint if and only if there exists a separating hyperplane, i.e., a nonzero λ ∈ Rm and µ ∈ R such that λT y ≤ µ on C and λT y ≥ µ on D. Each of these conditions can be simplified. The first means λT (b − Ax) ≤ µ for all x. This implies (as in example 2.19) that AT λ = 0 and λT b ≤ µ. The second inequality means λT y ≥ µ for all y ≻ 0. This implies µ ≤ 0 and λ 0, λ 6= 0. Putting it all together, we find that the set of strict inequalities (2.17) is infeasible if and only if there exists λ ∈ Rm such that λ 6= 0, λ 0, AT λ = 0, λT b ≤ 0. (2.18) This is also a system of linear inequalities and linear equations in the variable λ ∈ Rm . We say that (2.17) and (2.18) form a pair of alternatives: for any data A and b, exactly one of them is solvable. 2.5.2 Supporting hyperplanes Suppose C ⊆ Rn , and x0 is a point in its boundary bd C, i.e., x0 ∈ bd C = cl C \ int C. If a 6= 0 satisfies aT x ≤ aT x0 for all x ∈ C, then the hyperplane {x | aT x = aT x0 } is called a supporting hyperplane to C at the point x0 . This is equivalent to saying 2.6 Dual cones and generalized inequalities 51 a x0 C Figure 2.21 The hyperplane {x | aT x = aT x0 } supports C at x0 . that the point x0 and the set C are separated by the hyperplane {x | aT x = aT x0 }. The geometric interpretation is that the hyperplane {x | aT x = aT x0 } is tangent to C at x0 , and the halfspace {x | aT x ≤ aT x0 } contains C. This is illustrated in figure 2.21. A basic result, called the supporting hyperplane theorem, states that for any nonempty convex set C, and any x0 ∈ bd C, there exists a supporting hyperplane to C at x0 . The supporting hyperplane theorem is readily proved from the separating hyperplane theorem. We distinguish two cases. If the interior of C is nonempty, the result follows immediately by applying the separating hyperplane theorem to the sets {x0 } and int C. If the interior of C is empty, then C must lie in an affine set of dimension less than n, and any hyperplane containing that affine set contains C and x0 , and is a (trivial) supporting hyperplane. There is also a partial converse of the supporting hyperplane theorem: If a set is closed, has nonempty interior, and has a supporting hyperplane at every point in its boundary, then it is convex. (See exercise 2.27.) 2.6 Dual cones and generalized inequalities 2.6.1 Dual cones Let K be a cone. The set K ∗ = {y | xT y ≥ 0 for all x ∈ K} (2.19) is called the dual cone of K. As the name suggests, K ∗ is a cone, and is always convex, even when the original cone K is not (see exercise 2.31). Geometrically, y ∈ K ∗ if and only if −y is the normal of a hyperplane that supports K at the origin. This is illustrated in figure 2.22. Example 2.22 Subspace. The dual cone of a subspace V ⊆ Rn (which is a cone) is its orthogonal complement V ⊥ = {y | y T v = 0 for all v ∈ V }. 52 2 y K Convex sets K z Figure 2.22 Left. The halfspace with inward normal y contains the cone K, so y ∈ K ∗ . Right. The halfspace with inward normal z does not contain K, so z 6∈ K ∗ . Example 2.23 Nonnegative orthant. The cone Rn + is its own dual: y T x ≥ 0 for all x 0 ⇐⇒ y 0. We call such a cone self-dual. Example 2.24 Positive semidefinite cone. On the P set of symmetric n × n matrices n Sn , we use the standard inner product tr(XY ) = Xij Yij (see §A.1.1). The i,j=1 n positive semidefinite cone S+ is self-dual, i.e., for X, Y ∈ Sn , tr(XY ) ≥ 0 for all X 0 ⇐⇒ Y 0. We will establish this fact. n Suppose Y 6∈ Sn + . Then there exists q ∈ R with q T Y q = tr(qq T Y ) < 0. Hence the positive semidefinite matrix X = qq T satisfies tr(XY ) < 0; it follows that ∗ Y 6∈ (Sn +) . n Now suppose Pn X, Y T∈ S+ . We can express X in terms of its eigenvalue decomposition as X = i=1 λi qi qi , where (the eigenvalues) λi ≥ 0, i = 1, . . . , n. Then we have tr(Y X) = tr Y n X i=1 This shows that Y λi qi qiT ! = n X i=1 λi qiT Y qi ≥ 0. ∗ ∈ (Sn +) . Example 2.25 Dual of a norm cone. Let k · k be a norm on Rn . The dual of the associated cone K = {(x, t) ∈ Rn+1 | kxk ≤ t} is the cone defined by the dual norm, i.e., K ∗ = {(u, v) ∈ Rn+1 | kuk∗ ≤ v}, 2.6 Dual cones and generalized inequalities 53 where the dual norm is given by kuk∗ = sup{uT x | kxk ≤ 1} (see (A.1.6)). To prove the result we have to show that xT u + tv ≥ 0 whenever kxk ≤ t ⇐⇒ kuk∗ ≤ v. (2.20) Let us start by showing that the righthand condition on (u, v) implies the lefthand condition. Suppose kuk∗ ≤ v, and kxk ≤ t for some t > 0. (If t = 0, x must be zero, so obviously uT x + vt ≥ 0.) Applying the definition of the dual norm, and the fact that k−x/tk ≤ 1, we have uT (−x/t) ≤ kuk∗ ≤ v, and therefore uT x + vt ≥ 0. Next we show that the lefthand condition in (2.20) implies the righthand condition in (2.20). Suppose kuk∗ > v, i.e., that the righthand condition does not hold. Then by the definition of the dual norm, there exists an x with kxk ≤ 1 and xT u > v. Taking t = 1, we have uT (−x) + v < 0, which contradicts the lefthand condition in (2.20). Dual cones satisfy several properties, such as: • K ∗ is closed and convex. • K1 ⊆ K2 implies K2∗ ⊆ K1∗ . • If K has nonempty interior, then K ∗ is pointed. • If the closure of K is pointed then K ∗ has nonempty interior. • K ∗∗ is the closure of the convex hull of K. (Hence if K is convex and closed, K ∗∗ = K.) (See exercise 2.31.) These properties show that if K is a proper cone, then so is its dual K ∗ , and moreover, that K ∗∗ = K. 2.6.2 Dual generalized inequalities Now suppose that the convex cone K is proper, so it induces a generalized inequality K . Then its dual cone K ∗ is also proper, and therefore induces a generalized inequality. We refer to the generalized inequality K ∗ as the dual of the generalized inequality K . Some important properties relating a generalized inequality and its dual are: • x K y if and only if λT x ≤ λT y for all λ K ∗ 0. • x ≺K y if and only if λT x < λT y for all λ K ∗ 0, λ 6= 0. Since K = K ∗∗ , the dual generalized inequality associated with K ∗ is K , so these properties hold if the generalized inequality and its dual are swapped. As a specific example, we have λ K ∗ µ if and only if λT x ≤ µT x for all x K 0. 54 2 Convex sets Example 2.26 Theorem of alternatives for linear strict generalized inequalities. Suppose K ⊆ Rm is a proper cone. Consider the strict generalized inequality Ax ≺K b, (2.21) where x ∈ Rn . We will derive a theorem of alternatives for this inequality. Suppose it is infeasible, i.e., the affine set {b − Ax | x ∈ Rn } does not intersect the open convex set int K. Then there is a separating hyperplane, i.e., a nonzero λ ∈ Rm and µ ∈ R such that λT (b − Ax) ≤ µ for all x, and λT y ≥ µ for all y ∈ int K. The first condition implies AT λ = 0 and λT b ≤ µ. The second condition implies λT y ≥ µ for all y ∈ K, which can only happen if λ ∈ K ∗ and µ ≤ 0. Putting it all together we find that if (2.21) is infeasible, then there exists λ such that λ 6= 0, λ K ∗ 0, AT λ = 0, λT b ≤ 0. (2.22) Now we show the converse: if (2.22) holds, then the inequality system (2.21) cannot be feasible. Suppose that both inequality systems hold. Then we have λT (b − Ax) > 0, since λ 6= 0, λ K ∗ 0, and b − Ax ≻K 0. But using AT λ = 0 we find that λT (b − Ax) = λT b ≤ 0, which is a contradiction. Thus, the inequality systems (2.21) and (2.22) are alternatives: for any data A, b, exactly one of them is feasible. (This generalizes the alternatives (2.17), (2.18) for the special case K = Rm + .) 2.6.3 Minimum and minimal elements via dual inequalities We can use dual generalized inequalities to characterize minimum and minimal elements of a (possibly nonconvex) set S ⊆ Rm with respect to the generalized inequality induced by a proper cone K. Dual characterization of minimum element We first consider a characterization of the minimum element: x is the minimum element of S, with respect to the generalized inequality K , if and only if for all λ ≻K ∗ 0, x is the unique minimizer of λT z over z ∈ S. Geometrically, this means that for any λ ≻K ∗ 0, the hyperplane {z | λT (z − x) = 0} is a strict supporting hyperplane to S at x. (By strict supporting hyperplane, we mean that the hyperplane intersects S only at the point x.) Note that convexity of the set S is not required. This is illustrated in figure 2.23. To show this result, suppose x is the minimum element of S, i.e., x K z for all z ∈ S, and let λ ≻K ∗ 0. Let z ∈ S, z 6= x. Since x is the minimum element of S, we have z − x K 0. From λ ≻K ∗ 0 and z − x K 0, z − x 6= 0, we conclude λT (z − x) > 0. Since z is an arbitrary element of S, not equal to x, this shows that x is the unique minimizer of λT z over z ∈ S. Conversely, suppose that for all λ ≻K ∗ 0, x is the unique minimizer of λT z over z ∈ S, but x is not the minimum 2.6 Dual cones and generalized inequalities S x Figure 2.23 Dual characterization of minimum element. The point x is the minimum element of the set S with respect to R2+ . This is equivalent to: for every λ ≻ 0, the hyperplane {z | λT (z − x) = 0} strictly supports S at x, i.e., contains S on one side, and touches it only at x. element of S. Then there exists z ∈ S with z 6K x. Since z − x 6K 0, there exists λ̃ K ∗ 0 with λ̃T (z − x) < 0. Hence λT (z − x) < 0 for λ ≻K ∗ 0 in the neighborhood of λ̃. This contradicts the assumption that x is the unique minimizer of λT z over S. Dual characterization of minimal elements We now turn to a similar characterization of minimal elements. Here there is a gap between the necessary and sufficient conditions. If λ ≻K ∗ 0 and x minimizes λT z over z ∈ S, then x is minimal. This is illustrated in figure 2.24. To show this, suppose that λ ≻K ∗ 0, and x minimizes λT z over S, but x is not minimal, i.e., there exists a z ∈ S, z 6= x, and z K x. Then λT (x − z) > 0, which contradicts our assumption that x is the minimizer of λT z over S. The converse is in general false: a point x can be minimal in S, but not a minimizer of λT z over z ∈ S, for any λ, as shown in figure 2.25. This figure suggests that convexity plays an important role in the converse, which is correct. Provided the set S is convex, we can say that for any minimal element x there exists a nonzero λ K ∗ 0 such that x minimizes λT z over z ∈ S. To show this, suppose x is minimal, which means that ((x − K) \ {x}) ∩ S = ∅. Applying the separating hyperplane theorem to the convex sets (x − K) \ {x} and S, we conclude that there is a λ 6= 0 and µ such that λT (x − y) ≤ µ for all y ∈ K, and λT z ≥ µ for all z ∈ S. From the first inequality we conclude λ K ∗ 0. Since x ∈ S and x ∈ x − K, we have λT x = µ, so the second inequality implies that µ is the minimum value of λT z over S. Therefore, x is a minimizer of λT z over S, where λ 6= 0, λ K ∗ 0. This converse theorem cannot be strengthened to λ ≻K ∗ 0. Examples show that a point x can be a minimal point of a convex set S, but not a minimizer of 55 56 2 Convex sets λ1 x1 S λ2 x2 Figure 2.24 A set S ⊆ R2 . Its set of minimal points, with respect to R2+ , is shown as the darker section of its (lower, left) boundary. The minimizer of λT1 z over S is x1 , and is minimal since λ1 ≻ 0. The minimizer of λT2 z over S is x2 , which is another minimal point of S, since λ2 ≻ 0. S x Figure 2.25 The point x is a minimal element of S ⊆ R2 with respect to R2+ . However there exists no λ for which x minimizes λT z over z ∈ S. 2.6 Dual cones and generalized inequalities x1 S1 57 S2 x2 Figure 2.26 Left. The point x1 ∈ S1 is minimal, but is not a minimizer of λT z over S1 for any λ ≻ 0. (It does, however, minimize λT z over z ∈ S1 for λ = (1, 0).) Right. The point x2 ∈ S2 is not minimal, but it does minimize λT z over z ∈ S2 for λ = (0, 1) 0. λT z over z ∈ S for any λ ≻K ∗ 0. (See figure 2.26, left.) Nor is it true that any minimizer of λT z over z ∈ S, with λ K ∗ 0, is minimal (see figure 2.26, right.) Example 2.27 Pareto optimal production frontier. We consider a product which requires n resources (such as labor, electricity, natural gas, water) to manufacture. The product can be manufactured or produced in many ways. With each production method, we associate a resource vector x ∈ Rn , where xi denotes the amount of resource i consumed by the method to manufacture the product. We assume that xi ≥ 0 (i.e., resources are consumed by the production methods) and that the resources are valuable (so using less of any resource is preferred). The production set P ⊆ Rn is defined as the set of all resource vectors x that correspond to some production method. Production methods with resource vectors that are minimal elements of P , with respect to componentwise inequality, are called Pareto optimal or efficient. The set of minimal elements of P is called the efficient production frontier. We can give a simple interpretation of Pareto optimality. We say that one production method, with resource vector x, is better than another, with resource vector y, if xi ≤ yi for all i, and for some i, xi < yi . In other words, one production method is better than another if it uses no more of each resource than another method, and for at least one resource, actually uses less. This corresponds to x y, x 6= y. Then we can say: A production method is Pareto optimal or efficient if there is no better production method. We can find Pareto optimal production methods (i.e., minimal resource vectors) by minimizing λT x = λ1 x1 + · · · + λn xn over the set P of production vectors, using any λ that satisfies λ ≻ 0. Here the vector λ has a simple interpretation: λi is the price of resource i. By minimizing λT x over P we are finding the overall cheapest production method (for the resource prices λi ). As long as the prices are positive, the resulting production method is guaranteed to be efficient. These ideas are illustrated in figure 2.27. 58 2 Convex sets fuel P x1 x2 x5 x4 λ x3 labor Figure 2.27 The production set P , for a product that requires labor and fuel to produce, is shown shaded. The two dark curves show the efficient production frontier. The points x1 , x2 and x3 are efficient. The points x4 and x5 are not (since in particular, x2 corresponds to a production method that uses no more fuel, and less labor). The point x1 is also the minimum cost production method for the price vector λ (which is positive). The point x2 is efficient, but cannot be found by minimizing the total cost λT x for any price vector λ 0. Bibliography Bibliography Minkowski is generally credited with the first systematic study of convex sets, and the introduction of fundamental concepts such as supporting hyperplanes and the supporting hyperplane theorem, the Minkowski distance function (exercise 3.34), extreme points of a convex set, and many others. Some well known early surveys are Bonnesen and Fenchel [BF48], Eggleston [Egg58], Klee [Kle63], and Valentine [Val64]. More recent books devoted to the geometry of convex sets include Lay [Lay82] and Webster [Web94]. Klee [Kle71], Fenchel [Fen83], Tikhomorov [Tik90], and Berger [Ber90] give very readable overviews of the history of convexity and its applications throughout mathematics. Linear inequalities and polyhedral sets are studied extensively in connection with the linear programming problem, for which we give references at the end of chapter 4. Some landmark publications in the history of linear inequalities and linear programming are Motzkin [Mot33], von Neumann and Morgenstern [vNM53], Kantorovich [Kan60], Koopmans [Koo51], and Dantzig [Dan63]. Dantzig [Dan63, Chapter 2] includes an historical survey of linear inequalities, up to around 1963. Generalized inequalities were introduced in nonlinear optimization during the 1960s (see Luenberger [Lue69, §8.2] and Isii [Isi64]), and are used extensively in cone programming (see the references in chapter 4). Bellman and Fan [BF63] is an early paper on sets of generalized linear inequalities (with respect to the positive semidefinite cone). For extensions and a proof of the separating hyperplane theorem we refer the reader to Rockafellar [Roc70, part III], and Hiriart-Urruty and Lemaréchal [HUL93, volume 1, §III4]. Dantzig [Dan63, page 21] attributes the term theorem of the alternative to von Neumann and Morgenstern [vNM53, page 138]. For more references on theorems of alternatives, see chapter 5. The terminology of example 2.27 (including Pareto optimality, efficient production, and the price interpretation of λ) is discussed in detail by Luenberger [Lue95]. Convex geometry plays a prominent role in the classical theory of moments (Krein and Nudelman [KN77], Karlin and Studden [KS66]). A famous example is the duality between the cone of nonnegative polynomials and the cone of power moments; see exercise 2.37. 59 60 2 Convex sets Exercises Definition of convexity 2.1 Let C ⊆ Rn be a convex set, with x1 , . . . , xk ∈ C, and let θ1 , . . . , θk ∈ R satisfy θi ≥ 0, θ1 + · · · + θk = 1. Show that θ1 x1 + · · · + θk xk ∈ C. (The definition of convexity is that this holds for k = 2; you must show it for arbitrary k.) Hint. Use induction on k. 2.2 Show that a set is convex if and only if its intersection with any line is convex. Show that a set is affine if and only if its intersection with any line is affine. 2.3 Midpoint convexity. A set C is midpoint convex if whenever two points a, b are in C, the average or midpoint (a + b)/2 is in C. Obviously a convex set is midpoint convex. It can be proved that under mild conditions midpoint convexity implies convexity. As a simple case, prove that if C is closed and midpoint convex, then C is convex. 2.4 Show that the convex hull of a set S is the intersection of all convex sets that contain S. (The same method can be used to show that the conic, or affine, or linear hull of a set S is the intersection of all conic sets, or affine sets, or subspaces that contain S.) Examples 2.5 What is the distance between two parallel hyperplanes {x ∈ Rn | aT x = b1 } and {x ∈ Rn | aT x = b2 }? 2.6 When does one halfspace contain another? Give conditions under which {x | aT x ≤ b} ⊆ {x | ãT x ≤ b̃} (where a 6= 0, ã 6= 0). Also find the conditions under which the two halfspaces are equal. 2.7 Voronoi description of halfspace. Let a and b be distinct points in Rn . Show that the set of all points that are closer (in Euclidean norm) to a than b, i.e., {x | kx−ak2 ≤ kx−bk2 }, is a halfspace. Describe it explicitly as an inequality of the form cT x ≤ d. Draw a picture. 2.8 Which of the following sets S are polyhedra? If possible, express S in the form S = {x | Ax b, F x = g}. (a) S = {y1 a1 + y2 a2 | − 1 ≤ y1 ≤ 1, − 1 ≤ y2 ≤ 1}, where a1 , a2 ∈ Rn . (b) S = {x ∈ Rn | x 0, 1T x = 1, a1 , . . . , an ∈ R and b1 , b2 ∈ R. Pn i=1 xi ai = b1 , (c) S = {x ∈ Rn | x 0, xT y ≤ 1 for all y with kyk2 = 1}. (d) S = {x ∈ Rn | x 0, xT y ≤ 1 for all y with Pn i=1 Pn i=1 xi a2i = b2 }, where |yi | = 1}. 2.9 Voronoi sets and polyhedral decomposition. Let x0 , . . . , xK ∈ Rn . Consider the set of points that are closer (in Euclidean norm) to x0 than the other xi , i.e., V = {x ∈ Rn | kx − x0 k2 ≤ kx − xi k2 , i = 1, . . . , K}. V is called the Voronoi region around x0 with respect to x1 , . . . , xK . (a) Show that V is a polyhedron. Express V in the form V = {x | Ax b}. (b) Conversely, given a polyhedron P with nonempty interior, show how to find x0 , . . . , xK so that the polyhedron is the Voronoi region of x0 with respect to x1 , . . . , xK . (c) We can also consider the sets Vk = {x ∈ Rn | kx − xk k2 ≤ kx − xi k2 , i 6= k}. The set Vk consists of points in Rn for which the closest point in the set {x0 , . . . , xK } is xk . Exercises 61 The sets V0 , . . . , VKSgive a polyhedral decomposition of Rn . More precisely, the sets K Vk are polyhedra, k=0 Vk = Rn , and int Vi ∩ int Vj = ∅ for i 6= j, i.e., Vi and Vj intersect at most along a boundary. Sm Suppose that P1 , . . . , Pm are polyhedra such that i=1 Pi = Rn , and int Pi ∩ int Pj = ∅ for i 6= j. Can this polyhedral decomposition of Rn be described as the Voronoi regions generated by an appropriate set of points? 2.10 Solution set of a quadratic inequality. Let C ⊆ Rn be the solution set of a quadratic inequality, C = {x ∈ Rn | xT Ax + bT x + c ≤ 0}, with A ∈ Sn , b ∈ Rn , and c ∈ R. (a) Show that C is convex if A 0. (b) Show that the intersection of C and the hyperplane defined by g T x + h = 0 (where g 6= 0) is convex if A + λgg T 0 for some λ ∈ R. Are the converses of these statements true? 2 2.11 Hyperbolic sets. Show that the hyperbolic Qn set {x ∈ R+ | x1 x2 ≥ 1} is convex. As a n generalization, show that {x ∈ R+ | x ≥ 1} is convex. Hint. If a, b ≥ 0 and i=1 i 0 ≤ θ ≤ 1, then aθ b1−θ ≤ θa + (1 − θ)b; see §3.1.9. 2.12 Which of the following sets are convex? (a) A slab, i.e., a set of the form {x ∈ Rn | α ≤ aT x ≤ β}. (b) A rectangle, i.e., a set of the form {x ∈ Rn | αi ≤ xi ≤ βi , i = 1, . . . , n}. A rectangle is sometimes called a hyperrectangle when n > 2. (c) A wedge, i.e., {x ∈ Rn | aT1 x ≤ b1 , aT2 x ≤ b2 }. (d) The set of points closer to a given point than a given set, i.e., {x | kx − x0 k2 ≤ kx − yk2 for all y ∈ S} where S ⊆ Rn . (e) The set of points closer to one set than another, i.e., {x | dist(x, S) ≤ dist(x, T )}, n where S, T ⊆ R , and dist(x, S) = inf{kx − zk2 | z ∈ S}. (f) [HUL93, volume 1, page 93] The set {x | x + S2 ⊆ S1 }, where S1 , S2 ⊆ Rn with S1 convex. (g) The set of points whose distance to a does not exceed a fixed fraction θ of the distance to b, i.e., the set {x | kx − ak2 ≤ θkx − bk2 }. You can assume a 6= b and 0 ≤ θ ≤ 1. 2.13 Conic hull of outer products. Consider the set of rank-k outer products, defined as {XX T | X ∈ Rn×k , rank X = k}. Describe its conic hull in simple terms. 2.14 Expanded and restricted sets. Let S ⊆ Rn , and let k · k be a norm on Rn . (a) For a ≥ 0 we define Sa as {x | dist(x, S) ≤ a}, where dist(x, S) = inf y∈S kx − yk. We refer to Sa as S expanded or extended by a. Show that if S is convex, then Sa is convex. (b) For a ≥ 0 we define S−a = {x | B(x, a) ⊆ S}, where B(x, a) is the ball (in the norm k · k), centered at x, with radius a. We refer to S−a as S shrunk or restricted by a, since S−a consists of all points that are at least a distance a from Rn \S. Show that if S is convex, then S−a is convex. 62 2 Convex sets 2.15 Some sets of probability distributions. Let x be a real-valued random variable with prob(x = ai ) = pi , i = 1, . . . , n, where a1 < a2 < · · · < an . Of course p ∈ Rn lies in the standard probability simplex P = {p | 1T p = 1, p 0}. Which of the following conditions are convex in p? (That is, for which of the following conditions is the set of p ∈ P that satisfy the condition convex?) (a) P α ≤ E f (x) ≤ β, where E f (x) is the expected value of f (x), i.e., E f (x) = n p f (ai ). (The function f : R → R is given.) i=1 i (b) prob(x > α) ≤ β. (c) E |x3 | ≤ α E |x|. (d) E x2 ≤ α. (e) E x2 ≥ α. (f) var(x) ≤ α, where var(x) = E(x − E x)2 is the variance of x. (g) var(x) ≥ α. (h) quartile(x) ≥ α, where quartile(x) = inf{β | prob(x ≤ β) ≥ 0.25}. (i) quartile(x) ≤ α. Operations that preserve convexity 2.16 Show that if S1 and S2 are convex sets in Rm×n , then so is their partial sum S = {(x, y1 + y2 ) | x ∈ Rm , y1 , y2 ∈ Rn , (x, y1 ) ∈ S1 , (x, y2 ) ∈ S2 }. 2.17 Image of polyhedral sets under perspective function. In this problem we study the image of hyperplanes, halfspaces, and polyhedra under the perspective function P (x, t) = x/t, with dom P = Rn × R++ . For each of the following sets C, give a simple description of P (C) = {v/t | (v, t) ∈ C, t > 0}. (a) The polyhedron C = conv{(v1 , t1 ), . . . , (vK , tK )} where vi ∈ Rn and ti > 0. (b) The hyperplane C = {(v, t) | f T v + gt = h} (with f and g not both zero). (c) The halfspace C = {(v, t) | f T v + gt ≤ h} (with f and g not both zero). (d) The polyhedron C = {(v, t) | F v + gt h}. 2.18 Invertible linear-fractional functions. Let f : Rn → Rn be the linear-fractional function f (x) = (Ax + b)/(cT x + d), Suppose the matrix Q= dom f = {x | cT x + d > 0}. A cT b d is nonsingular. Show that f is invertible and that f −1 is a linear-fractional mapping. Give an explicit expression for f −1 and its domain in terms of A, b, c, and d. Hint. It may be easier to express f −1 in terms of Q. 2.19 Linear-fractional functions and convex sets. Let f : Rm → Rn be the linear-fractional function f (x) = (Ax + b)/(cT x + d), dom f = {x | cT x + d > 0}. In this problem we study the inverse image of a convex set C under f , i.e., f −1 (C) = {x ∈ dom f | f (x) ∈ C}. For each of the following sets C ⊆ Rn , give a simple description of f −1 (C). Exercises 63 (a) The halfspace C = {y | g T y ≤ h} (with g 6= 0). (b) The polyhedron C = {y | Gy h}. (c) The ellipsoid {y | y T P −1 y ≤ 1} (where P ∈ Sn ++ ). (d) The solution set of a linear matrix inequality, C = {y | y1 A1 + · · · + yn An B}, where A1 , . . . , An , B ∈ Sp . Separation theorems and supporting hyperplanes 2.20 Strictly positive solution of linear equations. Suppose A ∈ Rm×n , b ∈ Rm , with b ∈ R(A). Show that there exists an x satisfying x ≻ 0, Ax = b if and only if there exists no λ with AT λ 0, AT λ 6= 0, bT λ ≤ 0. Hint. First prove the following fact from linear algebra: cT x = d for all x satisfying Ax = b if and only if there is a vector λ such that c = AT λ, d = bT λ. 2.21 The set of separating hyperplanes. Suppose that C and D are disjoint subsets of Rn . Consider the set of (a, b) ∈ Rn+1 for which aT x ≤ b for all x ∈ C, and aT x ≥ b for all x ∈ D. Show that this set is a convex cone (which is the singleton {0} if there is no hyperplane that separates C and D). 2.22 Finish the proof of the separating hyperplane theorem in §2.5.1: Show that a separating hyperplane exists for two disjoint convex sets C and D. You can use the result proved in §2.5.1, i.e., that a separating hyperplane exists when there exist points in the two sets whose distance is equal to the distance between the two sets. Hint. If C and D are disjoint convex sets, then the set {x − y | x ∈ C, y ∈ D} is convex and does not contain the origin. 2.23 Give an example of two closed convex sets that are disjoint but cannot be strictly separated. 2.24 Supporting hyperplanes. (a) Express the closed convex set {x ∈ R2+ | x1 x2 ≥ 1} as an intersection of halfspaces. (b) Let C = {x ∈ Rn | kxk∞ ≤ 1}, the ℓ∞ -norm unit ball in Rn , and let x̂ be a point in the boundary of C. Identify the supporting hyperplanes of C at x̂ explicitly. 2.25 Inner and outer polyhedral approximations. Let C ⊆ Rn be a closed convex set, and suppose that x1 , . . . , xK are on the boundary of C. Suppose that for each i, aTi (x−xi ) = 0 defines a supporting hyperplane for C at xi , i.e., C ⊆ {x | aTi (x − xi ) ≤ 0}. Consider the two polyhedra Pinner = conv{x1 , . . . , xK }, Pouter = {x | aTi (x − xi ) ≤ 0, i = 1, . . . , K}. Show that Pinner ⊆ C ⊆ Pouter . Draw a picture illustrating this. 2.26 Support function. The support function of a set C ⊆ Rn is defined as SC (y) = sup{y T x | x ∈ C}. (We allow SC (y) to take on the value +∞.) Suppose that C and D are closed convex sets in Rn . Show that C = D if and only if their support functions are equal. 2.27 Converse supporting hyperplane theorem. Suppose the set C is closed, has nonempty interior, and has a supporting hyperplane at every point in its boundary. Show that C is convex. 64 2 Convex sets Convex cones and generalized inequalities 2.28 Positive semidefinite cone for n = 1, 2, 3. Give an explicit description of the positive semidefinite cone Sn + , in terms of the matrix coefficients and ordinary inequalities, for n = 1, 2, 3. To describe a general element of Sn , for n = 1, 2, 3, use the notation x1 , x1 x2 x2 x3 , " x1 x2 x3 x2 x4 x5 x3 x5 x6 # . 2.29 Cones in R2 . Suppose K ⊆ R2 is a closed convex cone. (a) Give a simple description of K in terms of the polar coordinates of its elements (x = r(cos φ, sin φ) with r ≥ 0). (b) Give a simple description of K ∗ , and draw a plot illustrating the relation between K and K ∗ . (c) When is K pointed? (d) When is K proper (hence, defines a generalized inequality)? Draw a plot illustrating what x K y means when K is proper. 2.30 Properties of generalized inequalities. Prove the properties of (nonstrict and strict) generalized inequalities listed in §2.4.1. 2.31 Properties of dual cones. Let K ∗ be the dual cone of a convex cone K, as defined in (2.19). Prove the following. (a) K ∗ is indeed a convex cone. (b) K1 ⊆ K2 implies K2∗ ⊆ K1∗ . (c) K ∗ is closed. (d) The interior of K ∗ is given by int K ∗ = {y | y T x > 0 for all x ∈ cl K}. (e) If K has nonempty interior then K ∗ is pointed. (f) K ∗∗ is the closure of K. (Hence if K is closed, K ∗∗ = K.) (g) If the closure of K is pointed then K ∗ has nonempty interior. 2.32 Find the dual cone of {Ax | x 0}, where A ∈ Rm×n . 2.33 The monotone nonnegative cone. We define the monotone nonnegative cone as Km+ = {x ∈ Rn | x1 ≥ x2 ≥ · · · ≥ xn ≥ 0}. i.e., all nonnegative vectors with components sorted in nonincreasing order. (a) Show that Km+ is a proper cone. ∗ (b) Find the dual cone Km+ . Hint. Use the identity n X i=1 xi yi = (x1 − x2 )y1 + (x2 − x3 )(y1 + y2 ) + (x3 − x4 )(y1 + y2 + y3 ) + · · · + (xn−1 − xn )(y1 + · · · + yn−1 ) + xn (y1 + · · · + yn ). 2.34 The lexicographic cone and ordering. The lexicographic cone is defined as Klex = {0} ∪ {x ∈ Rn | x1 = · · · = xk = 0, xk+1 > 0, for some k, 0 ≤ k < n}, i.e., all vectors whose first nonzero coefficient (if any) is positive. (a) Verify that Klex is a cone, but not a proper cone. Exercises 65 (b) We define the lexicographic ordering on Rn as follows: x ≤lex y if and only if y − x ∈ Klex . (Since Klex is not a proper cone, the lexicographic ordering is not a generalized inequality.) Show that the lexicographic ordering is a linear ordering: for any x, y ∈ Rn , either x ≤lex y or y ≤lex x. Therefore any set of vectors can be sorted with respect to the lexicographic cone, which yields the familiar sorting used in dictionaries. ∗ (c) Find Klex . 2.35 Copositive matrices. A matrix X ∈ Sn is called copositive if z T Xz ≥ 0 for all z 0. Verify that the set of copositive matrices is a proper cone. Find its dual cone. 2.36 Euclidean distance matrices. Let x1 , . . . , xn ∈ Rk . The matrix D ∈ Sn defined by Dij = kxi − xj k22 is called a Euclidean distance matrix. It satisfies some obvious properties such 1/2 1/2 1/2 as Dij = Dji , Dii = 0, Dij ≥ 0, and (from the triangle inequality) Dik ≤ Dij + Djk . We now pose the question: When is a matrix D ∈ Sn a Euclidean distance matrix (for some points in Rk , for some k)? A famous result answers this question: D ∈ Sn is a Euclidean distance matrix if and only if Dii = 0 and xT Dx ≤ 0 for all x with 1T x = 0. (See §8.3.3.) Show that the set of Euclidean distance matrices is a convex cone. 2.37 Nonnegative polynomials and Hankel LMIs. Let Kpol be the set of (coefficients of) nonnegative polynomials of degree 2k on R: Kpol = {x ∈ R2k+1 | x1 + x2 t + x3 t2 + · · · + x2k+1 t2k ≥ 0 for all t ∈ R}. (a) Show that Kpol is a proper cone. (b) A basic result states that a polynomial of degree 2k is nonnegative on R if and only if it can be expressed as the sum of squares of two polynomials of degree k or less. In other words, x ∈ Kpol if and only if the polynomial p(t) = x1 + x2 t + x3 t2 + · · · + x2k+1 t2k can be expressed as p(t) = r(t)2 + s(t)2 , where r and s are polynomials of degree k. Use this result to show that Kpol = ( x∈R 2k+1 xi = X Ymn for some Y m+n=i+1 ∈ Sk+1 + ) . In other words, p(t) = x1 + x2 t + x3 t2 + · · · + x2k+1 t2k is nonnegative if and only if there exists a matrix Y ∈ Sk+1 such that + x1 x2 x3 x2k+1 = = = .. . = Y11 Y12 + Y21 Y13 + Y22 + Y31 Yk+1,k+1 . ∗ = Khan where (c) Show that Kpol Khan = {z ∈ R2k+1 | H(z) 0} 66 2 and H(z) = z1 z2 z3 .. . zk zk+1 z2 z3 z4 .. . z3 z4 z5 .. . zk+1 zk+2 zk+2 zk+3 ··· ··· ··· .. . ··· ··· zk zk+1 zk+2 .. . z2k−1 z2k (This is the Hankel matrix with coefficients z1 , . . . , z2k+1 .) Convex sets zk+1 zk+2 zk+4 . .. . z2k z2k+1 (d) Let Kmom be the conic hull of the set of all vectors of the form (1, t, t2 , . . . , t2k ), where t ∈ R. Show that y ∈ Kmom if and only if y1 ≥ 0 and y = y1 (1, E u, E u2 , . . . , E u2k ) for some random variable u. In other words, the elements of Kmom are nonnegative multiples of the moment vectors of all possible distributions on R. Show that Kpol = ∗ Kmom . (e) Combining the results of (c) and (d), conclude that Khan = cl Kmom . As an example illustrating the relation between Kmom and Khan , take k = 2 and z = (1, 0, 0, 0, 1). Show that z ∈ Khan , z 6∈ Kmom . Find an explicit sequence of points in Kmom which converge to z. 2.38 [Roc70, pages 15, 61] Convex cones constructed from sets. (a) The barrier cone of a set C is defined as the set of all vectors y such that y T x is bounded above over x ∈ C. In other words, a nonzero vector y is in the barrier cone if and only if it is the normal vector of a halfspace {x | y T x ≤ α} that contains C. Verify that the barrier cone is a convex cone (with no assumptions on C). (b) The recession cone (also called asymptotic cone) of a set C is defined as the set of all vectors y such that for each x ∈ C, x − ty ∈ C for all t ≥ 0. Show that the recession cone of a convex set is a convex cone. Show that if C is nonempty, closed, and convex, then the recession cone of C is the dual of the barrier cone. (c) The normal cone of a set C at a boundary point x0 is the set of all vectors y such that y T (x − x0 ) ≤ 0 for all x ∈ C (i.e., the set of vectors that define a supporting hyperplane to C at x0 ). Show that the normal cone is a convex cone (with no assumptions on C). Give a simple description of the normal cone of a polyhedron {x | Ax b} at a point in its boundary. 2.39 Separation of cones. Let K and K̃ be two convex cones whose interiors are nonempty and disjoint. Show that there is a nonzero y such that y ∈ K ∗ , −y ∈ K̃ ∗ . Chapter 3 Convex functions 3.1 Basic properties and examples 3.1.1 Definition A function f : Rn → R is convex if dom f is a convex set and if for all x, y ∈ dom f , and θ with 0 ≤ θ ≤ 1, we have f (θx + (1 − θ)y) ≤ θf (x) + (1 − θ)f (y). (3.1) Geometrically, this inequality means that the line segment between (x, f (x)) and (y, f (y)), which is the chord from x to y, lies above the graph of f (figure 3.1). A function f is strictly convex if strict inequality holds in (3.1) whenever x 6= y and 0 < θ < 1. We say f is concave if −f is convex, and strictly concave if −f is strictly convex. For an affine function we always have equality in (3.1), so all affine (and therefore also linear) functions are both convex and concave. Conversely, any function that is convex and concave is affine. A function is convex if and only if it is convex when restricted to any line that intersects its domain. In other words f is convex if and only if for all x ∈ dom f and (y, f (y)) (x, f (x)) Figure 3.1 Graph of a convex function. The chord (i.e., line segment) between any two points on the graph lies above the graph. 68 3 Convex functions all v, the function g(t) = f (x + tv) is convex (on its domain, {t | x + tv ∈ dom f }). This property is very useful, since it allows us to check whether a function is convex by restricting it to a line. The analysis of convex functions is a well developed field, which we will not pursue in any depth. One simple result, for example, is that a convex function is continuous on the relative interior of its domain; it can have discontinuities only on its relative boundary. 3.1.2 Extended-value extensions It is often convenient to extend a convex function to all of Rn by defining its value to be ∞ outside its domain. If f is convex we define its extended-value extension f˜ : Rn → R ∪ {∞} by f˜(x) = f (x) x ∈ dom f ∞ x∈ 6 dom f. The extension f˜ is defined on all Rn , and takes values in R ∪ {∞}. We can recover the domain of the original function f from the extension f˜ as dom f = {x | f˜(x) < ∞}. The extension can simplify notation, since we do not need to explicitly describe the domain, or add the qualifier ‘for all x ∈ dom f ’ every time we refer to f (x). Consider, for example, the basic defining inequality (3.1). In terms of the extension f˜, we can express it as: for 0 < θ < 1, f˜(θx + (1 − θ)y) ≤ θf˜(x) + (1 − θ)f˜(y) for any x and y. (For θ = 0 or θ = 1 the inequality always holds.) Of course here we must interpret the inequality using extended arithmetic and ordering. For x and y both in dom f , this inequality coincides with (3.1); if either is outside dom f , then the righthand side is ∞, and the inequality therefore holds. As another example of this notational device, suppose f1 and f2 are two convex functions on Rn . The pointwise sum f = f1 + f2 is the function with domain dom f = dom f1 ∩ dom f2 , with f (x) = f1 (x) + f2 (x) for any x ∈ dom f . Using extended-value extensions we can simply say that for any x, f˜(x) = f˜1 (x) + f˜2 (x). In this equation the domain of f has been automatically defined as dom f = dom f1 ∩ dom f2 , since f˜(x) = ∞ whenever x 6∈ dom f1 or x 6∈ dom f2 . In this example we are relying on extended arithmetic to automatically define the domain. In this book we will use the same symbol to denote a convex function and its extension, whenever there is no harm from the ambiguity. This is the same as assuming that all convex functions are implicitly extended, i.e., are defined as ∞ outside their domains. Example 3.1 Indicator function of a convex set. Let C ⊆ Rn be a convex set, and consider the (convex) function IC with domain C and IC (x) = 0 for all x ∈ C. In other words, the function is identically zero on the set C. Its extended-value extension 3.1 Basic properties and examples 69 f (y) f (x) + ∇f (x)T (y − x) (x, f (x)) Figure 3.2 If f is convex and differentiable, then f (x)+∇f (x)T (y−x) ≤ f (y) for all x, y ∈ dom f . is given by I˜C (x) = 0 ∞ x∈C x 6∈ C. The convex function I˜C is called the indicator function of the set C. We can play several notational tricks with the indicator function I˜C . For example the problem of minimizing a function f (defined on all of Rn , say) on the set C is the same as minimizing the function f + I˜C over all of Rn . Indeed, the function f + I˜C is (by our convention) f restricted to the set C. In a similar way we can extend a concave function by defining it to be −∞ outside its domain. 3.1.3 First-order conditions Suppose f is differentiable (i.e., its gradient ∇f exists at each point in dom f , which is open). Then f is convex if and only if dom f is convex and f (y) ≥ f (x) + ∇f (x)T (y − x) (3.2) holds for all x, y ∈ dom f . This inequality is illustrated in figure 3.2. The affine function of y given by f (x)+∇f (x)T (y−x) is, of course, the first-order Taylor approximation of f near x. The inequality (3.2) states that for a convex function, the first-order Taylor approximation is in fact a global underestimator of the function. Conversely, if the first-order Taylor approximation of a function is always a global underestimator of the function, then the function is convex. The inequality (3.2) shows that from local information about a convex function (i.e., its value and derivative at a point) we can derive global information (i.e., a global underestimator of it). This is perhaps the most important property of convex functions, and explains some of the remarkable properties of convex functions and convex optimization problems. As one simple example, the inequality (3.2) shows that if ∇f (x) = 0, then for all y ∈ dom f , f (y) ≥ f (x), i.e., x is a global minimizer of the function f . 70 3 Convex functions Strict convexity can also be characterized by a first-order condition: f is strictly convex if and only if dom f is convex and for x, y ∈ dom f , x 6= y, we have f (y) > f (x) + ∇f (x)T (y − x). (3.3) For concave functions we have the corresponding characterization: f is concave if and only if dom f is convex and f (y) ≤ f (x) + ∇f (x)T (y − x) for all x, y ∈ dom f . Proof of first-order convexity condition To prove (3.2), we first consider the case n = 1: We show that a differentiable function f : R → R is convex if and only if f (y) ≥ f (x) + f ′ (x)(y − x) (3.4) for all x and y in dom f . Assume first that f is convex and x, y ∈ dom f . Since dom f is convex (i.e., an interval), we conclude that for all 0 < t ≤ 1, x + t(y − x) ∈ dom f , and by convexity of f , f (x + t(y − x)) ≤ (1 − t)f (x) + tf (y). If we divide both sides by t, we obtain f (y) ≥ f (x) + f (x + t(y − x)) − f (x) , t and taking the limit as t → 0 yields (3.4). To show sufficiency, assume the function satisfies (3.4) for all x and y in dom f (which is an interval). Choose any x 6= y, and 0 ≤ θ ≤ 1, and let z = θx + (1 − θ)y. Applying (3.4) twice yields f (x) ≥ f (z) + f ′ (z)(x − z), f (y) ≥ f (z) + f ′ (z)(y − z). Multiplying the first inequality by θ, the second by 1 − θ, and adding them yields θf (x) + (1 − θ)f (y) ≥ f (z), which proves that f is convex. Now we can prove the general case, with f : Rn → R. Let x, y ∈ Rn and consider f restricted to the line passing through them, i.e., the function defined by g(t) = f (ty + (1 − t)x), so g ′ (t) = ∇f (ty + (1 − t)x)T (y − x). First assume f is convex, which implies g is convex, so by the argument above we have g(1) ≥ g(0) + g ′ (0), which means f (y) ≥ f (x) + ∇f (x)T (y − x). Now assume that this inequality holds for any x and y, so if ty + (1 − t)x ∈ dom f and t̃y + (1 − t̃)x ∈ dom f , we have f (ty + (1 − t)x) ≥ f (t̃y + (1 − t̃)x) + ∇f (t̃y + (1 − t̃)x)T (y − x)(t − t̃), i.e., g(t) ≥ g(t̃) + g ′ (t̃)(t − t̃). We have seen that this implies that g is convex. 3.1 3.1.4 Basic properties and examples Second-order conditions We now assume that f is twice differentiable, that is, its Hessian or second derivative ∇2 f exists at each point in dom f , which is open. Then f is convex if and only if dom f is convex and its Hessian is positive semidefinite: for all x ∈ dom f , ∇2 f (x) 0. For a function on R, this reduces to the simple condition f ′′ (x) ≥ 0 (and dom f convex, i.e., an interval), which means that the derivative is nondecreasing. The condition ∇2 f (x) 0 can be interpreted geometrically as the requirement that the graph of the function have positive (upward) curvature at x. We leave the proof of the second-order condition as an exercise (exercise 3.8). Similarly, f is concave if and only if dom f is convex and ∇2 f (x) 0 for all x ∈ dom f . Strict convexity can be partially characterized by second-order conditions. If ∇2 f (x) ≻ 0 for all x ∈ dom f , then f is strictly convex. The converse, however, is not true: for example, the function f : R → R given by f (x) = x4 is strictly convex but has zero second derivative at x = 0. Example 3.2 Quadratic functions. Consider the quadratic function f : Rn → R, with dom f = Rn , given by f (x) = (1/2)xT P x + q T x + r, with P ∈ Sn , q ∈ Rn , and r ∈ R. Since ∇2 f (x) = P for all x, f is convex if and only if P 0 (and concave if and only if P 0). For quadratic functions, strict convexity is easily characterized: f is strictly convex if and only if P ≻ 0 (and strictly concave if and only if P ≺ 0). Remark 3.1 The separate requirement that dom f be convex cannot be dropped from the first- or second-order characterizations of convexity and concavity. For example, the function f (x) = 1/x2 , with dom f = {x ∈ R | x 6= 0}, satisfies f ′′ (x) > 0 for all x ∈ dom f , but is not a convex function. 3.1.5 Examples We have already mentioned that all linear and affine functions are convex (and concave), and have described the convex and concave quadratic functions. In this section we give a few more examples of convex and concave functions. We start with some functions on R, with variable x. • Exponential. eax is convex on R, for any a ∈ R. • Powers. xa is convex on R++ when a ≥ 1 or a ≤ 0, and concave for 0 ≤ a ≤ 1. • Powers of absolute value. |x|p , for p ≥ 1, is convex on R. • Logarithm. log x is concave on R++ . 71 72 3 Convex functions f (x, y) 2 1 0 2 2 1 0 y 0 −2 x Figure 3.3 Graph of f (x, y) = x2 /y. • Negative entropy. x log x (either on R++ , or on R+ , defined as 0 for x = 0) is convex. Convexity or concavity of these examples can be shown by verifying the basic inequality (3.1), or by checking that the second derivative is nonnegative or nonpositive. For example, with f (x) = x log x we have f ′ (x) = log x + 1, f ′′ (x) = 1/x, so that f ′′ (x) > 0 for x > 0. This shows that the negative entropy function is (strictly) convex. We now give a few interesting examples of functions on Rn . • Norms. Every norm on Rn is convex. • Max function. f (x) = max{x1 , . . . , xn } is convex on Rn . • Quadratic-over-linear function. The function f (x, y) = x2 /y, with dom f = R × R++ = {(x, y) ∈ R2 | y > 0}, is convex (figure 3.3). • Log-sum-exp. The function f (x) = log (ex1 + · · · + exn ) is convex on Rn . This function can be interpreted as a differentiable (in fact, analytic) approximation of the max function, since max{x1 , . . . , xn } ≤ f (x) ≤ max{x1 , . . . , xn } + log n for all x. (The second inequality is tight when all components of x are equal.) Figure 3.4 shows f for n = 2. 3.1 Basic properties and examples 73 4 f (x, y) 2 0 −2 2 0 y −2 −2 2 0 x Figure 3.4 Graph of f (x, y) = log(ex + ey ). • Geometric mean. The geometric mean f (x) = ( dom f = Rn++ . Qn i=1 xi ) 1/n is concave on • Log-determinant. The function f (X) = log det X is concave on dom f = Sn++ . Convexity (or concavity) of these examples can be verified in several ways, such as directly verifying the inequality (3.1), verifying that the Hessian is positive semidefinite, or restricting the function to an arbitrary line and verifying convexity of the resulting function of one variable. Norms. If f : Rn → R is a norm, and 0 ≤ θ ≤ 1, then f (θx + (1 − θ)y) ≤ f (θx) + f ((1 − θ)y) = θf (x) + (1 − θ)f (y). The inequality follows from the triangle inequality, and the equality follows from homogeneity of a norm. Max function. The function f (x) = maxi xi satisfies, for 0 ≤ θ ≤ 1, f (θx + (1 − θ)y) = max(θxi + (1 − θ)yi ) i ≤ θ max xi + (1 − θ) max yi i i = θf (x) + (1 − θ)f (y). Quadratic-over-linear function. To show that the quadratic-over-linear function f (x, y) = x2 /y is convex, we note that (for y > 0), 2 T 2 2 y −xy y y 2 ∇ f (x, y) = 3 = 3 0. −xy x2 −x −x y y 74 3 Convex functions Log-sum-exp. The Hessian of the log-sum-exp function is ∇2 f (x) = 1 (1T z)2 (1T z) diag(z) − zz T , where z = (ex1 , . . . , exn ). To verify that ∇2 f (x) 0 we must show that for all v, v T ∇2 f (x)v ≥ 0, i.e., ! n ! !2 n n X X X 1 zi v T ∇2 f (x)v = T 2 vi2 zi − vi zi ≥ 0. (1 z) i=1 i=1 i=1 But this follows from the Cauchy-Schwarz inequality (aT a)(bT b) ≥ (aT b)2 applied √ √ to the vectors with components ai = vi zi , bi = zi . Geometric mean. In a similar way we can show that the geometric mean f (x) = Qn 1/n ( i=1 xi ) is concave on dom f = Rn++ . Its Hessian ∇2 f (x) is given by Qn Qn 1/n 1/n ( i=1 xi ) ( i=1 xi ) ∂ 2 f (x) ∂ 2 f (x) = −(n − 1) , = for k 6= l, ∂x2k n2 x2k ∂xk ∂xl n2 xk xl and can be expressed as 2 ∇ f (x) = − Qn 1/n i=1 xi n2 n diag(1/x21 , . . . , 1/x2n ) − qq T where qi = 1/xi . We must show that ∇2 f (x) 0, i.e., that !2 Qn n n 1/n X X x vi2 /x2i − vi /xi ≤ 0 v T ∇2 f (x)v = − i=12 i n n i=1 i=1 for all v. Again this follows from the Cauchy-Schwarz inequality (aT a)(bT b) ≥ (aT b)2 , applied to the vectors a = 1 and bi = vi /xi . Log-determinant. For the function f (X) = log det X, we can verify concavity by considering an arbitrary line, given by X = Z + tV , where Z, V ∈ Sn . We define g(t) = f (Z + tV ), and restrict g to the interval of values of t for which Z + tV ≻ 0. Without loss of generality, we can assume that t = 0 is inside this interval, i.e., Z ≻ 0. We have g(t) = log det(Z + tV ) = log det(Z 1/2 (I + tZ −1/2 V Z −1/2 )Z 1/2 ) n X log(1 + tλi ) + log det Z = i=1 where λ1 , . . . , λn are the eigenvalues of Z −1/2 V Z −1/2 . Therefore we have n X λi g (t) = , 1 + tλi i=1 ′ ′′ g (t) = − Since g ′′ (t) ≤ 0, we conclude that f is concave. n X λ2i . (1 + tλi )2 i=1 3.1 3.1.6 Basic properties and examples 75 Sublevel sets The α-sublevel set of a function f : Rn → R is defined as Cα = {x ∈ dom f | f (x) ≤ α}. Sublevel sets of a convex function are convex, for any value of α. The proof is immediate from the definition of convexity: if x, y ∈ Cα , then f (x) ≤ α and f (y) ≤ α, and so f (θx + (1 − θ)y) ≤ α for 0 ≤ θ ≤ 1, and hence θx + (1 − θ)y ∈ Cα . The converse is not true: a function can have all its sublevel sets convex, but not be a convex function. For example, f (x) = −ex is not convex on R (indeed, it is strictly concave) but all its sublevel sets are convex. If f is concave, then its α-superlevel set, given by {x ∈ dom f | f (x) ≥ α}, is a convex set. The sublevel set property is often a good way to establish convexity of a set, by expressing it as a sublevel set of a convex function, or as the superlevel set of a concave function. Example 3.3 The geometric and arithmetic means of x ∈ Rn + are, respectively, G(x) = n Y i=1 1/n xi !1/n n , A(x) = 1X xi , n i=1 (where we take 0 = 0 in our definition of G). The arithmetic-geometric mean inequality states that G(x) ≤ A(x). Suppose 0 ≤ α ≤ 1, and consider the set {x ∈ Rn + | G(x) ≥ αA(x)}, i.e., the set of vectors with geometric mean at least as large as a factor α times the arithmetic mean. This set is convex, since it is the 0-superlevel set of the function G(x) − αA(x), which is concave. In fact, the set is positively homogeneous, so it is a convex cone. 3.1.7 Epigraph The graph of a function f : Rn → R is defined as n+1 which is a subset of R {(x, f (x)) | x ∈ dom f }, . The epigraph of a function f : Rn → R is defined as epi f = {(x, t) | x ∈ dom f, f (x) ≤ t}, which is a subset of Rn+1 . (‘Epi’ means ‘above’ so epigraph means ‘above the graph’.) The definition is illustrated in figure 3.5. The link between convex sets and convex functions is via the epigraph: A function is convex if and only if its epigraph is a convex set. A function is concave if and only if its hypograph, defined as hypo f = {(x, t) | t ≤ f (x)}, is a convex set. 76 3 Convex functions epi f f Figure 3.5 Epigraph of a function f , shown shaded. The lower boundary, shown darker, is the graph of f . Example 3.4 Matrix fractional function. The function f : Rn × Sn → R, defined as f (x, Y ) = xT Y −1 x is convex on dom f = Rn ×Sn ++ . (This generalizes the quadratic-over-linear function f (x, y) = x2 /y, with dom f = R × R++ .) One easy way to establish convexity of f is via its epigraph: epi f = = {(x, Y, t) | Y ≻ 0, xT Y −1 x ≤ t} (x, Y, t) Y xT x t 0, Y ≻ 0 , using the Schur complement condition for positive semidefiniteness of a block matrix (see §A.5.5). The last condition is a linear matrix inequality in (x, Y, t), and therefore epi f is convex. For the special case n = 1, the matrix fractional function reduces to the quadraticover-linear function x2 /y, and the associated LMI representation is y x x t 0, y>0 (the graph of which is shown in figure 3.3). Many results for convex functions can be proved (or interpreted) geometrically using epigraphs, and applying results for convex sets. As an example, consider the first-order condition for convexity: f (y) ≥ f (x) + ∇f (x)T (y − x), where f is convex and x, y ∈ dom f . We can interpret this basic inequality geometrically in terms of epi f . If (y, t) ∈ epi f , then t ≥ f (y) ≥ f (x) + ∇f (x)T (y − x). 3.1 Basic properties and examples 77 epi f (x, f (x)) (∇f (x), −1) Figure 3.6 For a differentiable convex function f , the vector (∇f (x), −1) defines a supporting hyperplane to the epigraph of f at x. We can express this as: (y, t) ∈ epi f =⇒ ∇f (x) −1 T y t − x f (x) ≤ 0. This means that the hyperplane defined by (∇f (x), −1) supports epi f at the boundary point (x, f (x)); see figure 3.6. 3.1.8 Jensen’s inequality and extensions The basic inequality (3.1), i.e., f (θx + (1 − θ)y) ≤ θf (x) + (1 − θ)f (y), is sometimes called Jensen’s inequality. It is easily extended to convex combinations of more than two points: If f is convex, x1 , . . . , xk ∈ dom f , and θ1 , . . . , θk ≥ 0 with θ1 + · · · + θk = 1, then f (θ1 x1 + · · · + θk xk ) ≤ θ1 f (x1 ) + · · · + θk f (xk ). As in the case of convex sets, the inequality extends to infinite R sums, integrals, and expected values. For example, if p(x) ≥ 0 on S ⊆ dom f , S p(x) dx = 1, then Z Z f (x)p(x) dx, p(x)x dx ≤ f S S provided the integrals exist. In the most general case we can take any probability measure with support in dom f . If x is a random variable such that x ∈ dom f with probability one, and f is convex, then we have f (E x) ≤ E f (x), (3.5) provided the expectations exist. We can recover the basic inequality (3.1) from this general form, by taking the random variable x to have support {x1 , x2 }, with 78 3 Convex functions prob(x = x1 ) = θ, prob(x = x2 ) = 1 − θ. Thus the inequality (3.5) characterizes convexity: If f is not convex, there is a random variable x, with x ∈ dom f with probability one, such that f (E x) > E f (x). All of these inequalities are now called Jensen’s inequality, even though the inequality studied by Jensen was the very simple one f (x) + f (y) x+y ≤ f . 2 2 Remark 3.2 We can interpret (3.5) as follows. Suppose x ∈ dom f ⊆ Rn and z is any zero mean random vector in Rn . Then we have E f (x + z) ≥ f (x). Thus, randomization or dithering (i.e., adding a zero mean random vector to the argument) cannot decrease the value of a convex function on average. 3.1.9 Inequalities Many famous inequalities can be derived by applying Jensen’s inequality to some appropriate convex function. (Indeed, convexity and Jensen’s inequality can be made the foundation of a theory of inequalities.) As a simple example, consider the arithmetic-geometric mean inequality: √ ab ≤ (a + b)/2 (3.6) for a, b ≥ 0. The function − log x is convex; Jensen’s inequality with θ = 1/2 yields a+b − log a − log b − log ≤ . 2 2 Taking the exponential of both sides yields (3.6). As a less trivial example we prove Hölder’s inequality: for p > 1, 1/p + 1/q = 1, and x, y ∈ Rn , !1/p n !1/q n n X X X xi yi ≤ |xi |p |yi |q . i=1 i=1 i=1 By convexity of − log x, and Jensen’s inequality with general θ, we obtain the more general arithmetic-geometric mean inequality aθ b1−θ ≤ θa + (1 − θ)b, valid for a, b ≥ 0 and 0 ≤ θ ≤ 1. Applying this with yields |xi |p , a = Pn p j=1 |xj | |x |p Pn i p j=1 |xj | !1/p |yi |q b = Pn , q j=1 |yj | |y |q Pn i q j=1 |yj | !1/q ≤ θ = 1/p, |x |p |yi |q Pn i P + . n p j=1 |xj |p q j=1 |yj |q Summing over i then yields Hölder’s inequality. 3.2 3.2 Operations that preserve convexity 79 Operations that preserve convexity In this section we describe some operations that preserve convexity or concavity of functions, or allow us to construct new convex and concave functions. We start with some simple operations such as addition, scaling, and pointwise supremum, and then describe some more sophisticated operations (some of which include the simple operations as special cases). 3.2.1 Nonnegative weighted sums Evidently if f is a convex function and α ≥ 0, then the function αf is convex. If f1 and f2 are both convex functions, then so is their sum f1 + f2 . Combining nonnegative scaling and addition, we see that the set of convex functions is itself a convex cone: a nonnegative weighted sum of convex functions, f = w1 f1 + · · · + wm fm , is convex. Similarly, a nonnegative weighted sum of concave functions is concave. A nonnegative, nonzero weighted sum of strictly convex (concave) functions is strictly convex (concave). These properties extend to infinite sums and integrals. For example if f (x, y) is convex in x for each y ∈ A, and w(y) ≥ 0 for each y ∈ A, then the function g defined as Z g(x) = w(y)f (x, y) dy A is convex in x (provided the integral exists). The fact that convexity is preserved under nonnegative scaling and addition is easily verified directly, or can be seen in terms of the associated epigraphs. For example, if w ≥ 0 and f is convex, we have epi(wf ) = I 0 0 w epi f, which is convex because the image of a convex set under a linear mapping is convex. 3.2.2 Composition with an affine mapping Suppose f : Rn → R, A ∈ Rn×m , and b ∈ Rn . Define g : Rm → R by g(x) = f (Ax + b), with dom g = {x | Ax + b ∈ dom f }. Then if f is convex, so is g; if f is concave, so is g. 80 3.2.3 3 Convex functions Pointwise maximum and supremum If f1 and f2 are convex functions then their pointwise maximum f , defined by f (x) = max{f1 (x), f2 (x)}, with dom f = dom f1 ∩ dom f2 , is also convex. This property is easily verified: if 0 ≤ θ ≤ 1 and x, y ∈ dom f , then f (θx + (1 − θ)y) = max{f1 (θx + (1 − θ)y), f2 (θx + (1 − θ)y)} ≤ max{θf1 (x) + (1 − θ)f1 (y), θf2 (x) + (1 − θ)f2 (y)} ≤ θ max{f1 (x), f2 (x)} + (1 − θ) max{f1 (y), f2 (y)} = θf (x) + (1 − θ)f (y), which establishes convexity of f . It is easily shown that if f1 , . . . , fm are convex, then their pointwise maximum f (x) = max{f1 (x), . . . , fm (x)} is also convex. Example 3.5 Piecewise-linear functions. The function f (x) = max{aT1 x + b1 , . . . , aTL x + bL } defines a piecewise-linear (or really, affine) function (with L or fewer regions). It is convex since it is the pointwise maximum of affine functions. The converse can also be shown: any piecewise-linear convex function with L or fewer regions can be expressed in this form. (See exercise 3.29.) Example 3.6 Sum of r largest components. For x ∈ Rn we denote by x[i] the ith largest component of x, i.e., x[1] ≥ x[2] ≥ · · · ≥ x[n] are the components of x sorted in nonincreasing order. Then the function f (x) = r X x[i] , i=1 i.e., the sum of the r largest elements of x, is a convex function. This can be seen by writing it as f (x) = r X i=1 x[i] = max{xi1 + · · · + xir | 1 ≤ i1 < i2 < · · · < ir ≤ n}, i.e., the maximum of all possible sums of r different components of x. Since it is the pointwise maximum of n!/(r!(n − r)!) linear functions, it is convex. As an extension it can be shown that the function w1 ≥ w2 ≥ · · · ≥ wr ≥ 0. (See exercise 3.19.) Pr i=1 wi x[i] is convex, provided 3.2 Operations that preserve convexity 81 The pointwise maximum property extends to the pointwise supremum over an infinite set of convex functions. If for each y ∈ A, f (x, y) is convex in x, then the function g, defined as g(x) = sup f (x, y) (3.7) y∈A is convex in x. Here the domain of g is dom g = {x | (x, y) ∈ dom f for all y ∈ A, sup f (x, y) < ∞}. y∈A Similarly, the pointwise infimum of a set of concave functions is a concave function. In terms of epigraphs, the pointwise supremum of functions corresponds to the intersection of epigraphs: with f , g, and A as defined in (3.7), we have \ epi g = epi f (·, y). y∈A Thus, the result follows from the fact that the intersection of a family of convex sets is convex. Example 3.7 Support function of a set. Let C ⊆ Rn , with C 6= ∅. The support function SC associated with the set C is defined as SC (x) = sup{xT y | y ∈ C} (and, naturally, dom SC = {x | supy∈C xT y < ∞}). For each y ∈ C, xT y is a linear function of x, so SC is the pointwise supremum of a family of linear functions, hence convex. Example 3.8 Distance to farthest point of a set. Let C ⊆ Rn . The distance (in any norm) to the farthest point of C, f (x) = sup kx − yk, y∈C is convex. To see this, note that for any y, the function kx − yk is convex in x. Since f is the pointwise supremum of a family of convex functions (indexed by y ∈ C), it is a convex function of x. Example 3.9 Least-squares cost as a function of weights. Let a1 , . . . P , an ∈ Rm . In a n weighted least-squares problem we minimize the objective function w (aT x − i=1 i i 2 m bi ) over x ∈ R . We refer to wi as weights, and allow negative wi (which opens the possibility that the objective function is unbounded below). We define the (optimal) weighted least-squares cost as g(w) = inf x with domain dom g = ( w inf x n X i=1 n X i=1 wi (aTi x − bi )2 , wi (aTi x − bi )2 > −∞ ) . 82 3 Convex functions Since g is the infimum of a family of linear functions of w (indexed by x ∈ Rm ), it is a concave function of w. We can derive an explicit expression for g, at least on part of its domain. Let W = diag(w), the diagonal matrix with elements w1 , . . . , wn , and let A ∈ Rn×m have rows aTi , so we have g(w) = inf (Ax − b)T W (Ax − b) = inf (xT AT W Ax − 2bT W Ax + bT W b). x x From this we see that if AT W A 6 0, the quadratic function is unbounded below in x, so g(w) = −∞, i.e., w 6∈ dom g. We can give a simple expression for g when AT W A ≻ 0 (which defines a strict linear matrix inequality), by analytically minimizing the quadratic function: g(w) = = bT W b − bT W A(AT W A)−1 AT W b n X i=1 wi b2i − n X wi2 b2i aTi n X wj aj aTj j=1 i=1 !−1 ai . Concavity of g from this expression is not immediately obvious (but does follow, for example, from convexity of the matrix fractional function; see example 3.4). Example 3.10 Maximum eigenvalue of a symmetric matrix. The function f (X) = λmax (X), with dom f = Sm , is convex. To see this, we express f as f (X) = sup{y T Xy | kyk2 = 1}, i.e., as the pointwise supremum of a family of linear functions of X (i.e., y T Xy) indexed by y ∈ Rm . Example 3.11 Norm of a matrix. Consider f (X) = kXk2 with dom f = Rp×q , where k · k2 denotes the spectral norm or maximum singular value. Convexity of f follows from f (X) = sup{uT Xv | kuk2 = 1, kvk2 = 1}, which shows it is the pointwise supremum of a family of linear functions of X. As a generalization suppose k · ka and k · kb are norms on Rp and Rq , respectively. The induced norm of a matrix X ∈ Rp×q is defined as kXvka . v6=0 kvkb kXka,b = sup (This reduces to the spectral norm when both norms are Euclidean.) The induced norm can be expressed as kXka,b = = sup{kXvka | kvkb = 1} sup{uT Xv | kuka∗ = 1, kvkb = 1}, where k · ka∗ is the dual norm of k · ka , and we use the fact that kzka = sup{uT z | kuka∗ = 1}. Since we have expressed kXka,b as a supremum of linear functions of X, it is a convex function. 3.2 Operations that preserve convexity 83 Representation as pointwise supremum of affine functions The examples above illustrate a good method for establishing convexity of a function: by expressing it as the pointwise supremum of a family of affine functions. Except for a technical condition, a converse holds: almost every convex function can be expressed as the pointwise supremum of a family of affine functions. For example, if f : Rn → R is convex, with dom f = Rn , then we have f (x) = sup{g(x) | g affine, g(z) ≤ f (z) for all z}. In other words, f is the pointwise supremum of the set of all affine global underestimators of it. We give the proof of this result below, and leave the case where dom f 6= Rn as an exercise (exercise 3.28). Suppose f is convex with dom f = Rn . The inequality f (x) ≥ sup{g(x) | g affine, g(z) ≤ f (z) for all z} is clear, since if g is any affine underestimator of f , we have g(x) ≤ f (x). To establish equality, we will show that for each x ∈ Rn , there is an affine function g, which is a global underestimator of f , and satisfies g(x) = f (x). The epigraph of f is, of course, a convex set. Hence we can find a supporting hyperplane to it at (x, f (x)), i.e., a ∈ Rn and b ∈ R with (a, b) 6= 0 and a b T x−z f (x) − t ≤0 for all (z, t) ∈ epi f . This means that aT (x − z) + b(f (x) − f (z) − s) ≤ 0 (3.8) for all z ∈ dom f = Rn and all s ≥ 0 (since (z, t) ∈ epi f means t = f (z) + s for some s ≥ 0). For the inequality (3.8) to hold for all s ≥ 0, we must have b ≥ 0. If b = 0, then the inequality (3.8) reduces to aT (x − z) ≤ 0 for all z ∈ Rn , which implies a = 0 and contradicts (a, b) 6= 0. We conclude that b > 0, i.e., that the supporting hyperplane is not vertical. Using the fact that b > 0 we rewrite (3.8) for s = 0 as g(z) = f (x) + (a/b)T (x − z) ≤ f (z) for all z. The function g is an affine underestimator of f , and satisfies g(x) = f (x). 3.2.4 Composition In this section we examine conditions on h : Rk → R and g : Rn → Rk that guarantee convexity or concavity of their composition f = h ◦ g : Rn → R, defined by f (x) = h(g(x)), dom f = {x ∈ dom g | g(x) ∈ dom h}. 84 3 Convex functions Scalar composition We first consider the case k = 1, so h : R → R and g : Rn → R. We can restrict ourselves to the case n = 1 (since convexity is determined by the behavior of a function on arbitrary lines that intersect its domain). To discover the composition rules, we start by assuming that h and g are twice differentiable, with dom g = dom h = R. In this case, convexity of f reduces to f ′′ ≥ 0 (meaning, f ′′ (x) ≥ 0 for all x ∈ R). The second derivative of the composition function f = h ◦ g is given by f ′′ (x) = h′′ (g(x))g ′ (x)2 + h′ (g(x))g ′′ (x). (3.9) Now suppose, for example, that g is convex (so g ′′ ≥ 0) and h is convex and nondecreasing (so h′′ ≥ 0 and h′ ≥ 0). It follows from (3.9) that f ′′ ≥ 0, i.e., f is convex. In a similar way, the expression (3.9) gives the results: f is convex if h is convex and nondecreasing, and g is convex, f is convex if h is convex and nonincreasing, and g is concave, f is concave if h is concave and nondecreasing, and g is concave, (3.10) f is concave if h is concave and nonincreasing, and g is convex. These statements are valid when the functions g and h are twice differentiable and have domains that are all of R. It turns out that very similar composition rules hold in the general case n > 1, without assuming differentiability of h and g, or that dom g = Rn and dom h = R: f is convex if h is convex, h̃ is nondecreasing, and g is convex, f is convex if h is convex, h̃ is nonincreasing, and g is concave, f is concave if h is concave, h̃ is nondecreasing, and g is concave, (3.11) f is concave if h is concave, h̃ is nonincreasing, and g is convex. Here h̃ denotes the extended-value extension of the function h, which assigns the value ∞ (−∞) to points not in dom h for h convex (concave). The only difference between these results, and the results in (3.10), is that we require that the extendedvalue extension function h̃ be nonincreasing or nondecreasing, on all of R. To understand what this means, suppose h is convex, so h̃ takes on the value ∞ outside dom h. To say that h̃ is nondecreasing means that for any x, y ∈ R, with x < y, we have h̃(x) ≤ h̃(y). In particular, this means that if y ∈ dom h, then x ∈ dom h. In other words, the domain of h extends infinitely in the negative direction; it is either R, or an interval of the form (−∞, a) or (−∞, a]. In a similar way, to say that h is convex and h̃ is nonincreasing means that h is nonincreasing and dom h extends infinitely in the positive direction. This is illustrated in figure 3.7. Example 3.12 Some simple examples will illustrate the conditions on h that appear in the composition theorems. • The function h(x) = log x, with dom h = R++ , is concave and satisfies h̃ nondecreasing. 3.2 Operations that preserve convexity 1 85 1 epi f 0 0 x epi f 1 0 0 x 1 Figure 3.7 Left. The function x2 , with domain R+ , is convex and nondecreasing on its domain, but its extended-value extension is not nondecreasing. Right. The function max{x, 0}2 , with domain R, is convex, and its extended-value extension is nondecreasing. • The function h(x) = x1/2 , with dom h = R+ , is concave and satisfies the condition h̃ nondecreasing. • The function h(x) = x3/2 , with dom h = R+ , is convex but does not satisfy the condition h̃ nondecreasing. For example, we have h̃(−1) = ∞, but h̃(1) = 1. • The function h(x) = x3/2 for x ≥ 0, and h(x) = 0 for x < 0, with dom h = R, is convex and does satisfy the condition h̃ nondecreasing. The composition results (3.11) can be proved directly, without assuming differentiability, or using the formula (3.9). As an example, we will prove the following composition theorem: if g is convex, h is convex, and h̃ is nondecreasing, then f = h ◦ g is convex. Assume that x, y ∈ dom f , and 0 ≤ θ ≤ 1. Since x, y ∈ dom f , we have that x, y ∈ dom g and g(x), g(y) ∈ dom h. Since dom g is convex, we conclude that θx + (1 − θ)y ∈ dom g, and from convexity of g, we have g(θx + (1 − θ)y) ≤ θg(x) + (1 − θ)g(y). (3.12) Since g(x), g(y) ∈ dom h, we conclude that θg(x) + (1 − θ)g(y) ∈ dom h, i.e., the righthand side of (3.12) is in dom h. Now we use the assumption that h̃ is nondecreasing, which means that its domain extends infinitely in the negative direction. Since the righthand side of (3.12) is in dom h, we conclude that the lefthand side, i.e., g(θx+(1−θ)y) ∈ dom h. This means that θx+(1−θ)y ∈ dom f . At this point, we have shown that dom f is convex. Now using the fact that h̃ is nondecreasing and the inequality (3.12), we get h(g(θx + (1 − θ)y)) ≤ h(θg(x) + (1 − θ)g(y)). (3.13) From convexity of h, we have h(θg(x) + (1 − θ)g(y)) ≤ θh(g(x)) + (1 − θ)h(g(y)). (3.14) 86 3 Convex functions Putting (3.13) and (3.14) together, we have h(g(θx + (1 − θ)y)) ≤ θh(g(x)) + (1 − θ)h(g(y)). which proves the composition theorem. Example 3.13 Simple composition results. • If g is convex then exp g(x) is convex. • If g is concave and positive, then log g(x) is concave. • If g is concave and positive, then 1/g(x) is convex. • If g is convex and nonnegative and p ≥ 1, then g(x)p is convex. • If g is convex then − log(−g(x)) is convex on {x | g(x) < 0}. Remark 3.3 The requirement that monotonicity hold for the extended-value extension h̃, and not just the function h, cannot be removed. For example, consider the function g(x) = x2 , with dom g = R, and h(x) = 0, with dom h = [1, 2]. Here g is convex, and h is convex and nondecreasing. But the function f = h ◦ g, given by √ √ f (x) = 0, dom f = [− 2, −1] ∪ [1, 2], is not convex, since its domain is not convex. Here, of course, the function h̃ is not nondecreasing. Vector composition We now turn to the more complicated case when k ≥ 1. Suppose f (x) = h(g(x)) = h(g1 (x), . . . , gk (x)), with h : Rk → R, gi : Rn → R. Again without loss of generality we can assume n = 1. As in the case k = 1, we start by assuming the functions are twice differentiable, with dom g = R and dom h = Rk , in order to discover the composition rules. We have f ′′ (x) = g ′ (x)T ∇2 h(g(x))g ′ (x) + ∇h(g(x))T g ′′ (x), (3.15) which is the vector analog of (3.9). Again the issue is to determine conditions under which f ′′ (x) ≥ 0 for all x (or f ′′ (x) ≤ 0 for all x for concavity). From (3.15) we can derive many rules, for example: f is convex if h is convex, h is nondecreasing in each argument, and gi are convex, f is convex if h is convex, h is nonincreasing in each argument, and gi are concave, f is concave if h is concave, h is nondecreasing in each argument, and gi are concave. 3.2 Operations that preserve convexity 87 As in the scalar case, similar composition results hold in general, with n > 1, no assumption of differentiability of h or g, and general domains. For the general results, the monotonicity condition on h must hold for the extended-value extension h̃. To understand the meaning of the condition that the extended-value extension h̃ be monotonic, we consider the case where h : Rk → R is convex, and h̃ nondecreasing, i.e., whenever u v, we have h̃(u) ≤ h̃(v). This implies that if v ∈ dom h, then so is u: the domain of h must extend infinitely in the −Rk+ directions. We can express this compactly as dom h − Rk+ = dom h. Example 3.14 Vector composition examples. • Let h(z) = z[1] + · · · + z[r] , the sum of the r largest components of z ∈ Rk . Then h is convex and nondecreasing in each argument. Suppose g1 , . . . , gk are convex functions on Rn . Then the composition function f = h ◦ g, i.e., the pointwise sum of the r largest gi ’s, is convex. Pk • The function h(z) = log( i=1 ezi ) is convex and nondecreasing in each arguPk ment, so log( i=1 egi ) is convex whenever gi are. Pk • For 0 < p ≤ 1, the function h(z) = ( i=1 zip )1/p on Rk+ is concave, and its extension (which has the value −∞ for z 6 0) is nondecreasing in each component. So if gi are concave and nonnegative, we conclude that f (x) = Pk ( i=1 gi (x)p )1/p is concave. • Suppose p ≥ 1, and g1 , . . . , gk are convex and nonnegative. Then the function Pk ( i=1 gi (x)p )1/p is convex. To show this, we consider the function h : Rk → R defined as h(z) = k X i=1 p max{zi , 0} !1/p , with dom h = Rk , so h = h̃. This function is convex, and nondecreasing, so we conclude h(g(x)) is a convex function of x. For z 0, we have h(z) = Pk Pk ( i=1 zip )1/p , so our conclusion is that ( i=1 gi (x)p )1/p is convex. Qk • The geometric mean h(z) = ( i=1 zi )1/k on Rk+ is concave and its extension is nondecreasing in each argument. It follows that if Q g1 , . . . , gk are nonnegative k concave functions, then so is their geometric mean, ( i=1 gi )1/k . 3.2.5 Minimization We have seen that the maximum or supremum of an arbitrary family of convex functions is convex. It turns out that some special forms of minimization also yield convex functions. If f is convex in (x, y), and C is a convex nonempty set, then the function g(x) = inf f (x, y) (3.16) y∈C 88 3 Convex functions is convex in x, provided g(x) > −∞ for some x (which implies g(x) > −∞ for all x). The domain of g is the projection of dom f on its x-coordinates, i.e., dom g = {x | (x, y) ∈ dom f for some y ∈ C}. We prove this by verifying Jensen’s inequality for x1 , x2 ∈ dom g. Let ǫ > 0. Then there are y1 , y2 ∈ C such that f (xi , yi ) ≤ g(xi ) + ǫ for i = 1, 2. Now let θ ∈ [0, 1]. We have g(θx1 + (1 − θ)x2 ) = inf f (θx1 + (1 − θ)x2 , y) y∈C ≤ f (θx1 + (1 − θ)x2 , θy1 + (1 − θ)y2 ) ≤ θf (x1 , y1 ) + (1 − θ)f (x2 , y2 ) ≤ θg(x1 ) + (1 − θ)g(x2 ) + ǫ. Since this holds for any ǫ > 0, we have g(θx1 + (1 − θ)x2 ) ≤ θg(x1 ) + (1 − θ)g(x2 ). The result can also be seen in terms of epigraphs. With f , g, and C defined as in (3.16), and assuming the infimum over y ∈ C is attained for each x, we have epi g = {(x, t) | (x, y, t) ∈ epi f for some y ∈ C}. Thus epi g is convex, since it is the projection of a convex set on some of its components. Example 3.15 Schur complement. Suppose the quadratic function f (x, y) = xT Ax + 2xT By + y T Cy, (where A and C are symmetric) is convex in (x, y), which means A BT B C 0. We can express g(x) = inf y f (x, y) as g(x) = xT (A − BC † B T )x, where C † is the pseudo-inverse of C (see §A.5.4). By the minimization rule, g is convex, so we conclude that A − BC † B T 0. If C is invertible, i.e., C ≻ 0, then the matrix A − BC −1 B T is called the Schur complement of C in the matrix A B BT C (see §A.5.5). Example 3.16 Distance to a set. The distance of a point x to a set S ⊆ Rn , in the norm k · k, is defined as dist(x, S) = inf kx − yk. y∈S The function kx−yk is convex in (x, y), so if the set S is convex, the distance function dist(x, S) is a convex function of x. 3.2 Operations that preserve convexity 89 Example 3.17 Suppose h is convex. Then the function g defined as g(x) = inf{h(y) | Ay = x} is convex. To see this, we define f by f (x, y) = h(y) ∞ if Ay = x otherwise, which is convex in (x, y). Then g is the minimum of f over y, and hence is convex. (It is not hard to show directly that g is convex.) 3.2.6 Perspective of a function If f : Rn → R, then the perspective of f is the function g : Rn+1 → R defined by g(x, t) = tf (x/t), with domain dom g = {(x, t) | x/t ∈ dom f, t > 0}. The perspective operation preserves convexity: If f is a convex function, then so is its perspective function g. Similarly, if f is concave, then so is g. This can be proved several ways, for example, direct verification of the defining inequality (see exercise 3.33). We give a short proof here using epigraphs and the perspective mapping on Rn+1 described in §2.3.3 (which will also explain the name ‘perspective’). For t > 0 we have (x, t, s) ∈ epi g ⇐⇒ tf (x/t) ≤ s ⇐⇒ f (x/t) ≤ s/t ⇐⇒ (x/t, s/t) ∈ epi f. Therefore epi g is the inverse image of epi f under the perspective mapping that takes (u, v, w) to (u, w)/v. It follows (see §2.3.3) that epi g is convex, so the function g is convex. Example 3.18 Euclidean norm squared. The perspective of the convex function f (x) = xT x on Rn is xT x g(x, t) = t(x/t)T (x/t) = , t which is convex in (x, t) for t > 0. We can deduce convexity of g using several other methods. First, we can express g as the sum of the quadratic-over-linear functions x2i /t, which were shown to be convex in §3.1.5. We can also express g as a special case of the matrix fractional function xT (tI)−1 x (see example 3.4). 90 3 Convex functions Example 3.19 Negative logarithm. Consider the convex function f (x) = − log x on R++ . Its perspective is g(x, t) = −t log(x/t) = t log(t/x) = t log t − t log x, and is convex on R2++ . The function g is called the relative entropy of t and x. For x = 1, g reduces to the negative entropy function. From convexity of g we can establish convexity or concavity of several interesting related functions. First, the relative entropy of two vectors u, v ∈ Rn ++ , defined as n X ui log(ui /vi ), i=1 is convex in (u, v), since it is a sum of relative entropies of ui , vi . A closely related function is the Kullback-Leibler divergence between u, v ∈ Rn ++ , given by Dkl (u, v) = n X i=1 (ui log(ui /vi ) − ui + vi ) , (3.17) which is convex, since it is the relative entropy plus a linear function of (u, v). The Kullback-Leibler divergence satisfies Dkl (u, v) ≥ 0, and Dkl (u, v) = 0 if and only if u = v, and so can be used as a measure of deviation between two positive vectors; see exercise 3.13. (Note that the relative entropy and the Kullback-Leibler divergence are the same when u and v are probability vectors, i.e., satisfy 1T u = 1T v = 1.) If we take vi = 1T u in the relative entropy function, we obtain the concave (and homogeneous) function of u ∈ Rn ++ given by n X ui log(1T u/ui ) = (1T u) i=1 n X zi log(1/zi ), i=1 where z = u/(1T u), which is called the normalized entropy function. The vector z = u/1T u is a normalized vector or probability distribution, since its components sum to one; the normalized entropy of u is 1T u times the entropy of this normalized distribution. Example 3.20 Suppose f : Rm → R is convex, and A ∈ Rm×n , b ∈ Rm , c ∈ Rn , and d ∈ R. We define g(x) = (cT x + d)f (Ax + b)/(cT x + d) , with dom g = {x | cT x + d > 0, (Ax + b)/(cT x + d) ∈ dom f }. Then g is convex. 3.3 The conjugate function In this section we introduce an operation that will play an important role in later chapters. 3.3 The conjugate function 91 f (x) xy x (0, −f ∗ (y)) Figure 3.8 A function f : R → R, and a value y ∈ R. The conjugate function f ∗ (y) is the maximum gap between the linear function yx and f (x), as shown by the dashed line in the figure. If f is differentiable, this occurs at a point x where f ′ (x) = y. 3.3.1 Definition and examples Let f : Rn → R. The function f ∗ : Rn → R, defined as y T x − f (x) , f ∗ (y) = sup (3.18) x∈dom f is called the conjugate of the function f . The domain of the conjugate function consists of y ∈ Rn for which the supremum is finite, i.e., for which the difference y T x − f (x) is bounded above on dom f . This definition is illustrated in figure 3.8. We see immediately that f ∗ is a convex function, since it is the pointwise supremum of a family of convex (indeed, affine) functions of y. This is true whether or not f is convex. (Note that when f is convex, the subscript x ∈ dom f is not necessary since, by convention, y T x − f (x) = −∞ for x 6∈ dom f .) We start with some simple examples, and then describe some rules for conjugating functions. This allows us to derive an analytical expression for the conjugate of many common convex functions. Example 3.21 We derive the conjugates of some convex functions on R. • Affine function. f (x) = ax + b. As a function of x, yx − ax − b is bounded if and only if y = a, in which case it is constant. Therefore the domain of the conjugate function f ∗ is the singleton {a}, and f ∗ (a) = −b. • Negative logarithm. f (x) = − log x, with dom f = R++ . The function xy+log x is unbounded above if y ≥ 0 and reaches its maximum at x = −1/y otherwise. Therefore, dom f ∗ = {y | y < 0} = −R++ and f ∗ (y) = − log(−y)−1 for y < 0. • Exponential. f (x) = ex . xy − ex is unbounded if y < 0. For y > 0, xy − ex reaches its maximum at x = log y, so we have f ∗ (y) = y log y − y. For y = 0, 92 3 Convex functions f ∗ (y) = supx −ex = 0. In summary, dom f ∗ = R+ and f ∗ (y) = y log y − y (with the interpretation 0 log 0 = 0). • Negative entropy. f (x) = x log x, with dom f = R+ (and f (0) = 0). The function xy − x log x is bounded above on R+ for all y, hence dom f ∗ = R. It attains its maximum at x = ey−1 , and substituting we find f ∗ (y) = ey−1 . • Inverse. f (x) = 1/x on R++ . For y > 0, yx − 1/x is unbounded above. For y = 0 this function has supremum 0; for y < 0 the supremum is attained at x = (−y)−1/2 . Therefore we have f ∗ (y) = −2(−y)1/2 , with dom f ∗ = −R+ . Example 3.22 Strictly convex quadratic function. Consider f (x) = 12 xT Qx, with T 1 T Q ∈ Sn ++ . The function y x − 2 x Qx is bounded above as a function of x for all y. It attains its maximum at x = Q−1 y, so f ∗ (y) = 1 T −1 y Q y. 2 Example 3.23 Log-determinant. We consider f (X) = log det X −1 on Sn ++ . The conjugate function is defined as f ∗ (Y ) = sup (tr(Y X) + log det X) , X≻0 since tr(Y X) is the standard inner product on Sn . We first show that tr(Y X) + log det X is unbounded above unless Y ≺ 0. If Y 6≺ 0, then Y has an eigenvector v, with kvk2 = 1, and eigenvalue λ ≥ 0. Taking X = I + tvv T we find that tr(Y X) + log det X = tr Y + tλ + log det(I + tvv T ) = tr Y + tλ + log(1 + t), which is unbounded above as t → ∞. Now consider the case Y ≺ 0. We can find the maximizing X by setting the gradient with respect to X equal to zero: ∇X (tr(Y X) + log det X) = Y + X −1 = 0 (see §A.4.1), which yields X = −Y −1 (which is, indeed, positive definite). Therefore we have f ∗ (Y ) = log det(−Y )−1 − n, with dom f ∗ = −Sn ++ . Example 3.24 Indicator function. Let IS be the indicator function of a (not necessarily convex) set S ⊆ Rn , i.e., IS (x) = 0 on dom IS = S. Its conjugate is IS∗ (y) = sup y T x, x∈S which is the support function of the set S. 3.3 The conjugate function 93 Example 3.25 Log-sum-exp function. To derive the conjugate of the log-sum-exp Pn function f (x) = log( i=1 exi ), we first determine the values of y for which the maximum over x of y T x − f (x) is attained. By setting the gradient with respect to x equal to zero, we obtain the condition exi , yi = Pn exj j=1 i = 1, . . . , n. These equations are solvable for x if and only if y ≻ 0 P and 1T y = 1. By substituting n T ∗ the expression for yi into y x−f (x) we obtain f (y) = i=1 yi log yi . This expression ∗ for f is still correct if some components of y are zero, as long as y 0 and 1T y = 1, and we interpret 0 log 0 as 0. In fact the domain of f ∗ is exactly given by 1T y = 1, y 0. To show this, suppose that a component of y is negative, say, yk < 0. Then we can show that y T x − f (x) is unbounded above by choosing xk = −t, and xi = 0, i 6= k, and letting t go to infinity. If y 0 but 1T y 6= 1, we choose x = t1, so that y T x − f (x) = t1T y − t − log n. If 1T y > 1, this grows unboundedly as t → ∞; if 1T y < 1, it grows unboundedly as t → −∞. In summary, ∗ f (y) = Pn ∞ i=1 yi log yi if y 0 and 1T y = 1 otherwise. In other words, the conjugate of the log-sum-exp function is the negative entropy function, restricted to the probability simplex. Example 3.26 Norm. Let k · k be a norm on Rn , with dual norm k · k∗ . We will show that the conjugate of f (x) = kxk is f ∗ (y) = 0 ∞ kyk∗ ≤ 1 otherwise, i.e., the conjugate of a norm is the indicator function of the dual norm unit ball. If kyk∗ > 1, then by definition of the dual norm, there is a z ∈ Rn with kzk ≤ 1 and y T z > 1. Taking x = tz and letting t → ∞, we have y T x − kxk = t(y T z − kzk) → ∞, which shows that f ∗ (y) = ∞. Conversely, if kyk∗ ≤ 1, then we have y T x ≤ kxkkyk∗ for all x, which implies for all x, y T x − kxk ≤ 0. Therefore x = 0 is the value that maximizes y T x − kxk, with maximum value 0. Example 3.27 Norm squared. Now consider the function f (x) = (1/2)kxk2 , where k·k is a norm, with dual norm k·k∗ . We will show that its conjugate is f ∗ (y) = (1/2)kyk2∗ . From y T x ≤ kyk∗ kxk, we conclude y T x − (1/2)kxk2 ≤ kyk∗ kxk − (1/2)kxk2 94 3 Convex functions for all x. The righthand side is a quadratic function of kxk, which has maximum value (1/2)kyk2∗ . Therefore for all x, we have y T x − (1/2)kxk2 ≤ (1/2)kyk2∗ , which shows that f ∗ (y) ≤ (1/2)kyk2∗ . To show the other inequality, let x be any vector with y T x = kyk∗ kxk, scaled so that kxk = kyk∗ . Then we have, for this x, y T x − (1/2)kxk2 = (1/2)kyk2∗ , which shows that f ∗ (y) ≥ (1/2)kyk2∗ . Example 3.28 Revenue and profit functions. We consider a business or enterprise that consumes n resources and produces a product that can be sold. We let r = (r1 , . . . , rn ) denote the vector of resource quantities consumed, and S(r) denote the sales revenue derived from the product produced (as a function of the resources consumed). Now let pi denote the price (per unit) of resource i, so the total amount paid for resources by the enterprise is pT r. The profit derived by the firm is then S(r) − pT r. Let us fix the prices of the resources, and ask what is the maximum profit that can be made, by wisely choosing the quantities of resources consumed. This maximum profit is given by M (p) = sup S(r) − pT r . r The function M (p) gives the maximum profit attainable, as a function of the resource prices. In terms of conjugate functions, we can express M as M (p) = (−S)∗ (−p). Thus the maximum profit (as a function of resource prices) is closely related to the conjugate of gross sales (as a function of resources consumed). 3.3.2 Basic properties Fenchel’s inequality From the definition of conjugate function, we immediately obtain the inequality f (x) + f ∗ (y) ≥ xT y for all x, y. This is called Fenchel’s inequality (or Young’s inequality when f is differentiable). For example with f (x) = (1/2)xT Qx, where Q ∈ Sn++ , we obtain the inequality xT y ≤ (1/2)xT Qx + (1/2)y T Q−1 y. Conjugate of the conjugate The examples above, and the name ‘conjugate’, suggest that the conjugate of the conjugate of a convex function is the original function. This is the case provided a technical condition holds: if f is convex, and f is closed (i.e., epi f is a closed set; see §A.3.3), then f ∗∗ = f . For example, if dom f = Rn , then we have f ∗∗ = f , i.e., the conjugate of the conjugate of f is f again (see exercise 3.39). 3.4 Quasiconvex functions Differentiable functions The conjugate of a differentiable function f is also called the Legendre transform of f . (To distinguish the general definition from the differentiable case, the term Fenchel conjugate is sometimes used instead of conjugate.) Suppose f is convex and differentiable, with dom f = Rn . Any maximizer x∗ of y T x − f (x) satisfies y = ∇f (x∗ ), and conversely, if x∗ satisfies y = ∇f (x∗ ), then x∗ maximizes y T x − f (x). Therefore, if y = ∇f (x∗ ), we have f ∗ (y) = x∗T ∇f (x∗ ) − f (x∗ ). This allows us to determine f ∗ (y) for any y for which we can solve the gradient equation y = ∇f (z) for z. We can express this another way. Let z ∈ Rn be arbitrary and define y = ∇f (z). Then we have f ∗ (y) = z T ∇f (z) − f (z). Scaling and composition with affine transformation For a > 0 and b ∈ R, the conjugate of g(x) = af (x) + b is g ∗ (y) = af ∗ (y/a) − b. Suppose A ∈ Rn×n is nonsingular and b ∈ Rn . Then the conjugate of g(x) = f (Ax + b) is g ∗ (y) = f ∗ (A−T y) − bT A−T y, with dom g ∗ = AT dom f ∗ . Sums of independent functions If f (u, v) = f1 (u) + f2 (v), where f1 and f2 are convex functions with conjugates f1∗ and f2∗ , respectively, then f ∗ (w, z) = f1∗ (w) + f2∗ (z). In other words, the conjugate of the sum of independent convex functions is the sum of the conjugates. (‘Independent’ means they are functions of different variables.) 3.4 Quasiconvex functions 3.4.1 Definition and examples A function f : Rn → R is called quasiconvex (or unimodal ) if its domain and all its sublevel sets Sα = {x ∈ dom f | f (x) ≤ α}, for α ∈ R, are convex. A function is quasiconcave if −f is quasiconvex, i.e., every superlevel set {x | f (x) ≥ α} is convex. A function that is both quasiconvex and quasiconcave is called quasilinear. If a function f is quasilinear, then its domain, and every level set {x | f (x) = α} is convex. 95 96 3 Convex functions β α a b c Figure 3.9 A quasiconvex function on R. For each α, the α-sublevel set Sα is convex, i.e., an interval. The sublevel set Sα is the interval [a, b]. The sublevel set Sβ is the interval (−∞, c]. For a function on R, quasiconvexity requires that each sublevel set be an interval (including, possibly, an infinite interval). An example of a quasiconvex function on R is shown in figure 3.9. Convex functions have convex sublevel sets, and so are quasiconvex. But simple examples, such as the one shown in figure 3.9, show that the converse is not true. Example 3.29 Some examples on R: • Logarithm. log x on R++ is quasiconvex (and quasiconcave, hence quasilinear). • Ceiling function. ceil(x) = inf{z ∈ Z | z ≥ x} is quasiconvex (and quasiconcave). These examples show that quasiconvex functions can be concave, or discontinuous. We now give some examples on Rn . Example 3.30 Length of a vector. We define the length of x ∈ Rn as the largest index of a nonzero component, i.e., f (x) = max{i | xi 6= 0}. (We define the length of the zero vector to be zero.) This function is quasiconvex on Rn , since its sublevel sets are subspaces: f (x) ≤ α ⇐⇒ xi = 0 for i = ⌊α⌋ + 1, . . . , n. Example 3.31 Consider f : R2 → R, with dom f = R2+ and f (x1 , x2 ) = x1 x2 . This function is neither convex nor concave since its Hessian 2 ∇ f (x) = 0 1 1 0 3.4 Quasiconvex functions 97 is indefinite; it has one positive and one negative eigenvalue. The function f is quasiconcave, however, since the superlevel sets {x ∈ R2+ | x1 x2 ≥ α} are convex sets for all α. (Note, however, that f is not quasiconcave on R2 .) Example 3.32 Linear-fractional function. The function f (x) = aT x + b , cT x + d with dom f = {x | cT x + d > 0}, is quasiconvex, and quasiconcave, i.e., quasilinear. Its α-sublevel set is Sα = = {x | cT x + d > 0, (aT x + b)/(cT x + d) ≤ α} {x | cT x + d > 0, aT x + b ≤ α(cT x + d)}, which is convex, since it is the intersection of an open halfspace and a closed halfspace. (The same method can be used to show its superlevel sets are convex.) Example 3.33 Distance ratio function. Suppose a, b ∈ Rn , and define f (x) = kx − ak2 , kx − bk2 i.e., the ratio of the Euclidean distance to a to the distance to b. Then f is quasiconvex on the halfspace {x | kx − ak2 ≤ kx − bk2 }. To see this, we consider the α-sublevel set of f , with α ≤ 1 since f (x) ≤ 1 on the halfspace {x | kx − ak2 ≤ kx − bk2 }. This sublevel set is the set of points satisfying kx − ak2 ≤ αkx − bk2 . Squaring both sides, and rearranging terms, we see that this is equivalent to (1 − α2 )xT x − 2(a − α2 b)T x + aT a − α2 bT b ≤ 0. This describes a convex set (in fact a Euclidean ball) if α ≤ 1. Example 3.34 Internal rate of return. Let x = (x0 , x1 , . . . , xn ) denote a cash flow sequence over n periods, where xi > 0 means a payment to us in period i, and xi < 0 means a payment by us in period i. We define the present value of a cash flow, with interest rate r ≥ 0, to be PV(x, r) = n X (1 + r)−i xi . i=0 (The factor (1 + r)−i is a discount factor for a payment by or to us in period i.) Now we consider cash flows for which x0 < 0 and x0 + x1 + · · · + xn > 0. This means that we start with an investment of |x0 | in period 0, and that the total of the 98 3 Convex functions remaining cash flow, x1 + · · · + xn , (not taking any discount factors into account) exceeds our initial investment. For such a cash flow, PV(x, 0) > 0 and PV(x, r) → x0 < 0 as r → ∞, so it follows that for at least one r ≥ 0, we have PV(x, r) = 0. We define the internal rate of return of the cash flow as the smallest interest rate r ≥ 0 for which the present value is zero: IRR(x) = inf{r ≥ 0 | PV(x, r) = 0}. Internal rate of return is a quasiconcave function of x (restricted to x0 < 0, x1 + · · · + xn > 0). To see this, we note that IRR(x) ≥ R ⇐⇒ PV(x, r) > 0 for 0 ≤ r < R. The lefthand side defines the R-superlevel set of IRR. The righthand side is the intersection of the sets {x | PV(x, r) > 0}, indexed by r, over the range 0 ≤ r < R. For each r, PV(x, r) > 0 defines an open halfspace, so the righthand side defines a convex set. 3.4.2 Basic properties The examples above show that quasiconvexity is a considerable generalization of convexity. Still, many of the properties of convex functions hold, or have analogs, for quasiconvex functions. For example, there is a variation on Jensen’s inequality that characterizes quasiconvexity: A function f is quasiconvex if and only if dom f is convex and for any x, y ∈ dom f and 0 ≤ θ ≤ 1, f (θx + (1 − θ)y) ≤ max{f (x), f (y)}, (3.19) i.e., the value of the function on a segment does not exceed the maximum of its values at the endpoints. The inequality (3.19) is sometimes called Jensen’s inequality for quasiconvex functions, and is illustrated in figure 3.10. Example 3.35 Cardinality of a nonnegative vector. The cardinality or size of a vector x ∈ Rn is the number of nonzero components, and denoted card(x). The n function card is quasiconcave on Rn + (but not R ). This follows immediately from the modified Jensen inequality card(x + y) ≥ min{card(x), card(y)}, which holds for x, y 0. Example 3.36 Rank of positive semidefinite matrix. The function rank X is quasiconcave on Sn + . This follows from the modified Jensen inequality (3.19), rank(X + Y ) ≥ min{rank X, rank Y } which holds for X, Y ∈ Sn + . (This can be considered an extension of the previous example, since rank(diag(x)) = card(x) for x 0.) 3.4 Quasiconvex functions 99 max{f (x), f (y)} (y, f (y)) (x, f (x)) Figure 3.10 A quasiconvex function on R. The value of f between x and y is no more than max{f (x), f (y)}. Like convexity, quasiconvexity is characterized by the behavior of a function f on lines: f is quasiconvex if and only if its restriction to any line intersecting its domain is quasiconvex. In particular, quasiconvexity of a function can be verified by restricting it to an arbitrary line, and then checking quasiconvexity of the resulting function on R. Quasiconvex functions on R We can give a simple characterization of quasiconvex functions on R. We consider continuous functions, since stating the conditions in the general case is cumbersome. A continuous function f : R → R is quasiconvex if and only if at least one of the following conditions holds: • f is nondecreasing • f is nonincreasing • there is a point c ∈ dom f such that for t ≤ c (and t ∈ dom f ), f is nonincreasing, and for t ≥ c (and t ∈ dom f ), f is nondecreasing. The point c can be chosen as any point which is a global minimizer of f . Figure 3.11 illustrates this. 3.4.3 Differentiable quasiconvex functions First-order conditions Suppose f : Rn → R is differentiable. Then f is quasiconvex if and only if dom f is convex and for all x, y ∈ dom f f (y) ≤ f (x) =⇒ ∇f (x)T (y − x) ≤ 0. (3.20) 100 3 c Convex functions t Figure 3.11 A quasiconvex function on R. The function is nonincreasing for t ≤ c and nondecreasing for t ≥ c. x ∇f (x) Figure 3.12 Three level curves of a quasiconvex function f are shown. The vector ∇f (x) defines a supporting hyperplane to the sublevel set {z | f (z) ≤ f (x)} at x. This is the analog of inequality (3.2), for quasiconvex functions. We leave the proof as an exercise (exercise 3.43). The condition (3.20) has a simple geometric interpretation when ∇f (x) 6= 0. It states that ∇f (x) defines a supporting hyperplane to the sublevel set {y | f (y) ≤ f (x)}, at the point x, as illustrated in figure 3.12. While the first-order condition for convexity (3.2), and the first-order condition for quasiconvexity (3.20) are similar, there are some important differences. For example, if f is convex and ∇f (x) = 0, then x is a global minimizer of f . But this statement is false for quasiconvex functions: it is possible that ∇f (x) = 0, but x is not a global minimizer of f . 3.4 Quasiconvex functions 101 Second-order conditions Now suppose f is twice differentiable. If f is quasiconvex, then for all x ∈ dom f , and all y ∈ Rn , we have y T ∇f (x) = 0 =⇒ y T ∇2 f (x)y ≥ 0. (3.21) For a quasiconvex function on R, this reduces to the simple condition f ′ (x) = 0 =⇒ f ′′ (x) ≥ 0, i.e., at any point with zero slope, the second derivative is nonnegative. For a quasiconvex function on Rn , the interpretation of the condition (3.21) is a bit more complicated. As in the case n = 1, we conclude that whenever ∇f (x) = 0, we must have ∇2 f (x) 0. When ∇f (x) 6= 0, the condition (3.21) means that ∇2 f (x) is positive semidefinite on the (n − 1)-dimensional subspace ∇f (x)⊥ . This implies that ∇2 f (x) can have at most one negative eigenvalue. As a (partial) converse, if f satisfies y T ∇f (x) = 0 =⇒ y T ∇2 f (x)y > 0 (3.22) for all x ∈ dom f and all y ∈ Rn , y 6= 0, then f is quasiconvex. This condition is the same as requiring ∇2 f (x) to be positive definite for any point with ∇f (x) = 0, and for all other points, requiring ∇2 f (x) to be positive definite on the (n − 1)dimensional subspace ∇f (x)⊥ . Proof of second-order conditions for quasiconvexity By restricting the function to an arbitrary line, it suffices to consider the case in which f : R → R. We first show that if f : R → R is quasiconvex on an interval (a, b), then it must satisfy (3.21), i.e., if f ′ (c) = 0 with c ∈ (a, b), then we must have f ′′ (c) ≥ 0. If f ′ (c) = 0 with c ∈ (a, b), f ′′ (c) < 0, then for small positive ǫ we have f (c−ǫ) < f (c) and f (c + ǫ) < f (c). It follows that the sublevel set {x | f (x) ≤ f (c) − ǫ} is disconnected for small positive ǫ, and therefore not convex, which contradicts our assumption that f is quasiconvex. Now we show that if the condition (3.22) holds, then f is quasiconvex. Assume that (3.22) holds, i.e., for each c ∈ (a, b) with f ′ (c) = 0, we have f ′′ (c) > 0. This means that whenever the function f ′ crosses the value 0, it is strictly increasing. Therefore it can cross the value 0 at most once. If f ′ does not cross the value 0 at all, then f is either nonincreasing or nondecreasing on (a, b), and therefore quasiconvex. Otherwise it must cross the value 0 exactly once, say at c ∈ (a, b). Since f ′′ (c) > 0, it follows that f ′ (t) ≤ 0 for a < t ≤ c, and f ′ (t) ≥ 0 for c ≤ t < b. This shows that f is quasiconvex. 3.4.4 Operations that preserve quasiconvexity Nonnegative weighted maximum A nonnegative weighted maximum of quasiconvex functions, i.e., f = max{w1 f1 , . . . , wm fm }, 102 3 Convex functions with wi ≥ 0 and fi quasiconvex, is quasiconvex. The property extends to the general pointwise supremum f (x) = sup (w(y)g(x, y)) y∈C where w(y) ≥ 0 and g(x, y) is quasiconvex in x for each y. This fact can be easily verified: f (x) ≤ α if and only if w(y)g(x, y) ≤ α for all y ∈ C, i.e., the α-sublevel set of f is the intersection of the α-sublevel sets of the functions w(y)g(x, y) in the variable x. Example 3.37 Generalized eigenvalue. The maximum generalized eigenvalue of a pair of symmetric matrices (X, Y ), with Y ≻ 0, is defined as uT Xu = sup{λ | det(λY − X) = 0}. T u6=0 u Y u λmax (X, Y ) = sup (See §A.5.3). This function is quasiconvex on dom f = Sn × Sn ++ . To see this we consider the expression uT Xu . T u6=0 u Y u λmax (X, Y ) = sup For each u 6= 0, the function uT Xu/uT Y u is linear-fractional in (X, Y ), hence a quasiconvex function of (X, Y ). We conclude that λmax is quasiconvex, since it is the supremum of a family of quasiconvex functions. Composition If g : Rn → R is quasiconvex and h : R → R is nondecreasing, then f = h ◦ g is quasiconvex. The composition of a quasiconvex function with an affine or linear-fractional transformation yields a quasiconvex function. If f is quasiconvex, then g(x) = f (Ax + b) is quasiconvex, and g̃(x) = f ((Ax + b)/(cT x + d)) is quasiconvex on the set {x | cT x + d > 0, (Ax + b)/(cT x + d) ∈ dom f }. Minimization If f (x, y) is quasiconvex jointly in x and y and C is a convex set, then the function g(x) = inf f (x, y) y∈C is quasiconvex. To show this, we need to show that {x | g(x) ≤ α} is convex, where α ∈ R is arbitrary. From the definition of g, g(x) ≤ α if and only if for any ǫ > 0 there exists 3.4 Quasiconvex functions 103 a y ∈ C with f (x, y) ≤ α + ǫ. Now let x1 and x2 be two points in the α-sublevel set of g. Then for any ǫ > 0, there exists y1 , y2 ∈ C with f (x1 , y1 ) ≤ α + ǫ, f (x2 , y2 ) ≤ α + ǫ, and since f is quasiconvex in x and y, we also have f (θx1 + (1 − θ)x2 , θy1 + (1 − θ)y2 ) ≤ α + ǫ, for 0 ≤ θ ≤ 1. Hence g(θx1 + (1 − θ)x2 ) ≤ α, which proves that {x | g(x) ≤ α} is convex. 3.4.5 Representation via family of convex functions In the sequel, it will be convenient to represent the sublevel sets of a quasiconvex function f (which are convex) via inequalities of convex functions. We seek a family of convex functions φt : Rn → R, indexed by t ∈ R, with f (x) ≤ t ⇐⇒ φt (x) ≤ 0, (3.23) i.e., the t-sublevel set of the quasiconvex function f is the 0-sublevel set of the convex function φt . Evidently φt must satisfy the property that for all x ∈ Rn , φt (x) ≤ 0 =⇒ φs (x) ≤ 0 for s ≥ t. This is satisfied if for each x, φt (x) is a nonincreasing function of t, i.e., φs (x) ≤ φt (x) whenever s ≥ t. To see that such a representation always exists, we can take φt (x) = 0 ∞ f (x) ≤ t otherwise, i.e., φt is the indicator function of the t-sublevel of f . Obviously this representation is not unique; for example if the sublevel sets of f are closed, we can take φt (x) = dist (x, {z | f (z) ≤ t}) . We are usually interested in a family φt with nice properties, such as differentiability. Example 3.38 Convex over concave function. Suppose p is a convex function, q is a concave function, with p(x) ≥ 0 and q(x) > 0 on a convex set C. Then the function f defined by f (x) = p(x)/q(x), on C, is quasiconvex. Here we have f (x) ≤ t ⇐⇒ p(x) − tq(x) ≤ 0, so we can take φt (x) = p(x) − tq(x) for t ≥ 0. For each t, φt is convex and for each x, φt (x) is decreasing in t. 104 3 3.5 Log-concave and log-convex functions 3.5.1 Definition Convex functions A function f : Rn → R is logarithmically concave or log-concave if f (x) > 0 for all x ∈ dom f and log f is concave. It is said to be logarithmically convex or log-convex if log f is convex. Thus f is log-convex if and only if 1/f is logconcave. It is convenient to allow f to take on the value zero, in which case we take log f (x) = −∞. In this case we say f is log-concave if the extended-value function log f is concave. We can express log-concavity directly, without logarithms: a function f : Rn → R, with convex domain and f (x) > 0 for all x ∈ dom f , is log-concave if and only if for all x, y ∈ dom f and 0 ≤ θ ≤ 1, we have f (θx + (1 − θ)y) ≥ f (x)θ f (y)1−θ . In particular, the value of a log-concave function at the average of two points is at least the geometric mean of the values at the two points. From the composition rules we know that eh is convex if h is convex, so a logconvex function is convex. Similarly, a nonnegative concave function is log-concave. It is also clear that a log-convex function is quasiconvex and a log-concave function is quasiconcave, since the logarithm is monotone increasing. Example 3.39 Some simple examples of log-concave and log-convex functions. • Affine function. f (x) = aT x + b is log-concave on {x | aT x + b > 0}. • Powers. f (x) = xa , on R++ , is log-convex for a ≤ 0, and log-concave for a ≥ 0. • Exponentials. f (x) = eax is log-convex and log-concave. • The cumulative distribution function of a Gaussian density, 1 Φ(x) = √ 2π Z x 2 e−u /2 du, −∞ is log-concave (see exercise 3.54). • Gamma function. The Gamma function, Γ(x) = Z ∞ ux−1 e−u du, 0 is log-convex for x ≥ 1 (see exercise 3.52). • Determinant. det X is log concave on Sn ++ . • Determinant over trace. det X/ tr X is log concave on Sn ++ (see exercise 3.49). Example 3.40 Log-concave density functions. Many common probability density functions are log-concave. Two examples are the multivariate normal distribution, f (x) = p 1 (2π)n det Σ 1 T e− 2 (x−x̄) Σ −1 (x−x̄) 3.5 Log-concave and log-convex functions 105 n (where x̄ ∈ Rn and Σ ∈ Sn ++ ), and the exponential distribution on R+ , n Y f (x) = λi i=1 ! T e−λ x (where λ ≻ 0). Another example is the uniform distribution over a convex set C, f (x) = 1/α 0 x∈C x 6∈ C where α = vol(C) is the volume (Lebesgue measure) of C. In this case log f takes on the value −∞ outside C, and − log α on C, hence is concave. As a more exotic example consider the Wishart distribution, defined as follows. Let x1 , . . . , xp ∈ Rn be independent Gaussian random vectors Pp with zero mean and covariance Σ ∈ Sn , with p > n. The random matrix X = i=1 xi xTi has the Wishart density −1 1 f (X) = a (det X)(p−n−1)/2 e− 2 tr(Σ X) , with dom f = Sn ++ , and a is a positive constant. The Wishart density is log-concave, since p−n−1 1 log f (X) = log a + log det X − tr(Σ−1 X), 2 2 which is a concave function of X. 3.5.2 Properties Twice differentiable log-convex/concave functions Suppose f is twice differentiable, with dom f convex, so ∇2 log f (x) = 1 1 ∇2 f (x) − ∇f (x)∇f (x)T . f (x) f (x)2 We conclude that f is log-convex if and only if for all x ∈ dom f , f (x)∇2 f (x) ∇f (x)∇f (x)T , and log-concave if and only if for all x ∈ dom f , f (x)∇2 f (x) ∇f (x)∇f (x)T . Multiplication, addition, and integration Log-convexity and log-concavity are closed under multiplication and positive scaling. For example, if f and g are log-concave, then so is the pointwise product h(x) = f (x)g(x), since log h(x) = log f (x) + log g(x), and log f (x) and log g(x) are concave functions of x. Simple examples show that the sum of log-concave functions is not, in general, log-concave. Log-convexity, however, is preserved under sums. Let f and g be logconvex functions, i.e., F = log f and G = log g are convex. From the composition rules for convex functions, it follows that log (exp F + exp G) = log(f + g) 106 3 Convex functions is convex. Therefore the sum of two log-convex functions is log-convex. More generally, if f (x, y) is log-convex in x for each y ∈ C then Z g(x) = f (x, y) dy C is log-convex. Example 3.41 Laplace transform of a nonnegative function and the moment and cumulant generating functions. Suppose p : Rn → R satisfies p(x) ≥ 0 for all x. The Laplace transform of p, Z P (z) = T p(x)e−z x dx, is log-convex on Rn . (Here dom P is, naturally, {z | P (z) < ∞}.) R Now suppose p is a density, i.e., satisfies p(x) dx = 1. The function M (z) = P (−z) is called the moment generating function of the density. It gets its name from the fact that the moments of the density can be found from the derivatives of the moment generating function, evaluated at z = 0, e.g., ∇M (0) = E v, ∇2 M (0) = E vv T , where v is a random variable with density p. The function log M (z), which is convex, is called the cumulant generating function for p, since its derivatives give the cumulants of the density. For example, the first and second derivatives of the cumulant generating function, evaluated at zero, are the mean and covariance of the associated random variable: ∇ log M (0) = E v, ∇2 log M (0) = E(v − E v)(v − E v)T . Integration of log-concave functions In some special cases log-concavity is preserved by integration. If f : Rn ×Rm → R is log-concave, then Z g(x) = f (x, y) dy is a log-concave function of x (on Rn ). (The integration here is over Rm .) A proof of this result is not simple; see the references. This result has many important consequences, some of which we describe in the rest of this section. It implies, for example, that marginal distributions of logconcave probability densities are log-concave. It also implies that log-concavity is closed under convolution, i.e., if f and g are log-concave on Rn , then so is the convolution Z (f ∗ g)(x) = f (x − y)g(y) dy. (To see this, note that g(y) and f (x−y) are log-concave in (x, y), hence the product f (x − y)g(y) is; then the integration result applies.) 3.5 Log-concave and log-convex functions 107 Suppose C ⊆ Rn is a convex set and w is a random vector in Rn with logconcave probability density p. Then the function f (x) = prob(x + w ∈ C) is log-concave in x. To see this, express f as Z f (x) = g(x + w)p(w) dw, where g is defined as g(u) = 1 u∈C 0 u∈ 6 C, (which is log-concave) and apply the integration result. Example 3.42 The cumulative distribution function of a probability density function f : Rn → R is defined as F (x) = prob(w x) = Z xn −∞ ··· Z x1 −∞ f (z) dz1 · · · dzn , where w is a random variable with density f . If f is log-concave, then F is logconcave. We have already encountered a special case: the cumulative distribution function of a Gaussian random variable, 1 f (x) = √ 2π Z x 2 e−t /2 dt, −∞ is log-concave. (See example 3.39 and exercise 3.54.) Example 3.43 Yield function. Let x ∈ Rn denote the nominal or target value of a set of parameters of a product that is manufactured. Variation in the manufacturing process causes the parameters of the product, when manufactured, to have the value x + w, where w ∈ Rn is a random vector that represents manufacturing variation, and is usually assumed to have zero mean. The yield of the manufacturing process, as a function of the nominal parameter values, is given by Y (x) = prob(x + w ∈ S), where S ⊆ Rn denotes the set of acceptable parameter values for the product, i.e., the product specifications. If the density of the manufacturing error w is log-concave (for example, Gaussian) and the set S of product specifications is convex, then the yield function Y is log-concave. This implies that the α-yield region, defined as the set of nominal parameters for which the yield exceeds α, is convex. For example, the 95% yield region {x | Y (x) ≥ 0.95} = {x | log Y (x) ≥ log 0.95} is convex, since it is a superlevel set of the concave function log Y . 108 3 Convex functions Example 3.44 Volume of polyhedron. Let A ∈ Rm×n . Define Pu = {x ∈ Rn | Ax u}. Then its volume vol Pu is a log-concave function of u. To prove this, note that the function Ψ(x, u) = 1 0 Ax u otherwise, is log-concave. By the integration result, we conclude that Z Ψ(x, u) dx = vol Pu is log-concave. 3.6 Convexity with respect to generalized inequalities We now consider generalizations of the notions of monotonicity and convexity, using generalized inequalities instead of the usual ordering on R. 3.6.1 Monotonicity with respect to a generalized inequality Suppose K ⊆ Rn is a proper cone with associated generalized inequality K . A function f : Rn → R is called K-nondecreasing if x K y =⇒ f (x) ≤ f (y), and K-increasing if x K y, x 6= y =⇒ f (x) < f (y). We define K-nonincreasing and K-decreasing functions in a similar way. Example 3.45 Monotone vector functions. A function f : Rn → R is nondecreasing with respect to Rn + if and only if x1 ≤ y1 , . . . , xn ≤ yn =⇒ f (x) ≤ f (y) for all x, y. This is the same as saying that f , when restricted to any component xi (i.e., xi is considered the variable while xj for j 6= i are fixed), is nondecreasing. Example 3.46 Matrix monotone functions. A function f : Sn → R is called matrix monotone (increasing, decreasing) if it is monotone with respect to the positive semidefinite cone. Some examples of matrix monotone functions of the variable X ∈ Sn : 3.6 Convexity with respect to generalized inequalities • tr(W X), where W ∈ Sn , is matrix nondecreasing if W 0, and matrix increasing if W ≻ 0 (it is matrix nonincreasing if W 0, and matrix decreasing if W ≺ 0). • tr(X −1 ) is matrix decreasing on Sn ++ . n • det X is matrix increasing on S++ , and matrix nondecreasing on Sn +. Gradient conditions for monotonicity Recall that a differentiable function f : R → R, with convex (i.e., interval) domain, is nondecreasing if and only if f ′ (x) ≥ 0 for all x ∈ dom f , and increasing if f ′ (x) > 0 for all x ∈ dom f (but the converse is not true). These conditions are readily extended to the case of monotonicity with respect to a generalized inequality. A differentiable function f , with convex domain, is K-nondecreasing if and only if ∇f (x) K ∗ 0 (3.24) for all x ∈ dom f . Note the difference with the simple scalar case: the gradient must be nonnegative in the dual inequality. For the strict case, we have the following: If (3.25) ∇f (x) ≻K ∗ 0 for all x ∈ dom f , then f is K-increasing. As in the scalar case, the converse is not true. Let us prove these first-order conditions for monotonicity. First, assume that f satisfies (3.24) for all x, but is not K-nondecreasing, i.e., there exist x, y with x K y and f (y) < f (x). By differentiability of f there exists a t ∈ [0, 1] with d f (x + t(y − x)) = ∇f (x + t(y − x))T (y − x) < 0. dt Since y − x ∈ K this means ∇f (x + t(y − x)) 6∈ K ∗ , which contradicts our assumption that (3.24) is satisfied everywhere. In a similar way it can be shown that (3.25) implies f is K-increasing. It is also straightforward to see that it is necessary that (3.24) hold everywhere. Assume (3.24) does not hold for x = z. By the definition of dual cone this means there exists a v ∈ K with ∇f (z)T v < 0. Now consider h(t) = f (z + tv) as a function of t. We have h′ (0) = ∇f (z)T v < 0, and therefore there exists t > 0 with h(t) = f (z + tv) < h(0) = f (z), which means f is not K-nondecreasing. 3.6.2 Convexity with respect to a generalized inequality Suppose K ⊆ Rm is a proper cone with associated generalized inequality K . We say f : Rn → Rm is K-convex if for all x, y, and 0 ≤ θ ≤ 1, f (θx + (1 − θ)y) K θf (x) + (1 − θ)f (y). 109 110 3 Convex functions The function is strictly K-convex if f (θx + (1 − θ)y) ≺K θf (x) + (1 − θ)f (y) for all x 6= y and 0 < θ < 1. These definitions reduce to ordinary convexity and strict convexity when m = 1 (and K = R+ ). Example 3.47 Convexity with respect to componentwise inequality. A function f : Rn → Rm is convex with respect to componentwise inequality (i.e., the generalized inequality induced by Rm + ) if and only if for all x, y and 0 ≤ θ ≤ 1, f (θx + (1 − θ)y) θf (x) + (1 − θ)f (y), i.e., each component fi is a convex function. The function f is strictly convex with respect to componentwise inequality if and only if each component fi is strictly convex. Example 3.48 Matrix convexity. Suppose f is a symmetric matrix valued function, i.e., f : Rn → Sm . The function f is convex with respect to matrix inequality if f (θx + (1 − θ)y) θf (x) + (1 − θ)f (y) for any x and y, and for θ ∈ [0, 1]. This is sometimes called matrix convexity. An equivalent definition is that the scalar function z T f (x)z is convex for all vectors z. (This is often a good way to prove matrix convexity). A matrix function is strictly matrix convex if f (θx + (1 − θ)y) ≺ θf (x) + (1 − θ)f (y) when x 6= y and 0 < θ < 1, or, equivalently, if z T f z is strictly convex for every z 6= 0. Some examples: • The function f (X) = XX T where X ∈ Rn×m is matrix convex, since for fixed z the function z T XX T z = kX T zk22 is a convex quadratic function of (the components of) X. For the same reason, f (X) = X 2 is matrix convex on Sn . • The function X p is matrix convex on Sn ++ for 1 ≤ p ≤ 2 or −1 ≤ p ≤ 0, and matrix concave for 0 ≤ p ≤ 1. • The function f (X) = eX is not matrix convex on Sn , for n ≥ 2. Many of the results for convex functions have extensions to K-convex functions. As a simple example, a function is K-convex if and only if its restriction to any line in its domain is K-convex. In the rest of this section we list a few results for K-convexity that we will use later; more results are explored in the exercises. Dual characterization of K-convexity A function f is K-convex if and only if for every w K ∗ 0, the (real-valued) function wT f is convex (in the ordinary sense); f is strictly K-convex if and only if for every nonzero w K ∗ 0 the function wT f is strictly convex. (These follow directly from the definitions and properties of dual inequality.) 3.6 Convexity with respect to generalized inequalities Differentiable K-convex functions A differentiable function f is K-convex if and only if its domain is convex, and for all x, y ∈ dom f , f (y) K f (x) + Df (x)(y − x). (Here Df (x) ∈ Rm×n is the derivative or Jacobian matrix of f at x; see §A.4.1.) The function f is strictly K-convex if and only if for all x, y ∈ dom f with x 6= y, f (y) ≻K f (x) + Df (x)(y − x). Composition theorem Many of the results on composition can be generalized to K-convexity. For example, if g : Rn → Rp is K-convex, h : Rp → R is convex, and h̃ (the extended-value extension of h) is K-nondecreasing, then h ◦ g is convex. This generalizes the fact that a nondecreasing convex function of a convex function is convex. The condition that h̃ be K-nondecreasing implies that dom h − K = dom h. Example 3.49 The quadratic matrix function g : Rm×n → Sn defined by g(X) = X T AX + B T X + X T B + C, where A ∈ Sm , B ∈ Rm×n , and C ∈ Sn , is convex when A 0. The function h : Sn → R defined by h(Y ) = − log det(−Y ) is convex and increasing on dom h = −Sn ++ . By the composition theorem, we conclude that f (X) = − log det(−(X T AX + B T X + X T B + C)) is convex on dom f = {X ∈ Rm×n | X T AX + B T X + X T B + C ≺ 0}. This generalizes the fact that − log(−(ax2 + bx + c)) is convex on provided a ≥ 0. {x ∈ R | ax2 + bx + c < 0}, 111 112 3 Convex functions Bibliography The standard reference on convex analysis is Rockafellar [Roc70]. Other books on convex functions are Stoer and Witzgall [SW70], Roberts and Varberg [RV73], Van Tiel [vT84], Hiriart-Urruty and Lemaréchal [HUL93], Ekeland and Témam [ET99], Borwein and Lewis [BL00], Florenzano and Le Van [FL01], Barvinok [Bar02], and Bertsekas, Nedić, and Ozdaglar [Ber03]. Most nonlinear programming texts also include chapters on convex functions (see, for example, Mangasarian [Man94], Bazaraa, Sherali, and Shetty [BSS93], Bertsekas [Ber99], Polyak [Pol87], and Peressini, Sullivan, and Uhl [PSU88]). Jensen’s inequality appears in [Jen06]. A general study of inequalities, in which Jensen’s inequality plays a central role, is presented by Hardy, Littlewood, and Pólya [HLP52], and Beckenbach and Bellman [BB65]. The term perspective function is from Hiriart-Urruty and Lemaréchal [HUL93, volume 1, page 100]. For the definitions in example 3.19 (relative entropy and Kullback-Leibler divergence), and the related exercise 3.13, see Cover and Thomas [CT91]. Some important early references on quasiconvex functions (as well as other extensions of convexity) are Nikaidô [Nik54], Mangasarian [Man94, chapter 9], Arrow and Enthoven [AE61], Ponstein [Pon67], and Luenberger [Lue68]. For a more comprehensive reference list, we refer to Bazaraa, Sherali, and Shetty [BSS93, page 126]. Prékopa [Pré80] gives a survey of log-concave functions. Log-convexity of the Laplace transform is mentioned in Barndorff-Nielsen [BN78, §7]. For a proof of the integration result of log-concave functions, see Prékopa [Pré71, Pré73]. Generalized inequalities are used extensively in the recent literature on cone programming, starting with Nesterov and Nemirovski [NN94, page 156]; see also Ben-Tal and Nemirovski [BTN01] and the references at the end of chapter 4. Convexity with respect to generalized inequalities also appears in the work of Luenberger [Lue69, §8.2] and Isii [Isi64]. Matrix monotonicity and matrix convexity are attributed to Löwner [Löw34], and are discussed in detail by Davis [Dav63], Roberts and Varberg [RV73, page 216] and Marshall and Olkin [MO79, §16E]. For the result on convexity and concavity of the function X p in example 3.48, see Bondar [Bon94, theorem 16.1]. For a simple example that demonstrates that eX is not matrix convex, see Marshall and Olkin [MO79, page 474]. Exercises 113 Exercises Definition of convexity 3.1 Suppose f : R → R is convex, and a, b ∈ dom f with a < b. (a) Show that f (x) ≤ for all x ∈ [a, b]. b−x x−a f (a) + f (b) b−a b−a (b) Show that f (b) − f (a) f (b) − f (x) f (x) − f (a) ≤ ≤ x−a b−a b−x for all x ∈ (a, b). Draw a sketch that illustrates this inequality. (c) Suppose f is differentiable. Use the result in (b) to show that f ′ (a) ≤ f (b) − f (a) ≤ f ′ (b). b−a Note that these inequalities also follow from (3.2): f (b) ≥ f (a) + f ′ (a)(b − a), f (a) ≥ f (b) + f ′ (b)(a − b). (d) Suppose f is twice differentiable. Use the result in (c) to show that f ′′ (a) ≥ 0 and f ′′ (b) ≥ 0. 3.2 Level sets of convex, concave, quasiconvex, and quasiconcave functions. Some level sets of a function f are shown below. The curve labeled 1 shows {x | f (x) = 1}, etc. 3 2 1 Could f be convex (concave, quasiconvex, quasiconcave)? Explain your answer. Repeat for the level curves shown below. 1 2 3 4 5 6 114 3 Convex functions 3.3 Inverse of an increasing convex function. Suppose f : R → R is increasing and convex on its domain (a, b). Let g denote its inverse, i.e., the function with domain (f (a), f (b)) and g(f (x)) = x for a < x < b. What can you say about convexity or concavity of g? 3.4 [RV73, page 15] Show that a continuous function f : Rn → R is convex if and only if for every line segment, its average value on the segment is less than or equal to the average of its values at the endpoints of the segment: For every x, y ∈ Rn , Z 1 0 f (x + λ(y − x)) dλ ≤ f (x) + f (y) . 2 3.5 [RV73, page 22] Running average of a convex function. Suppose f : R → R is convex, with R+ ⊆ dom f . Show that its running average F , defined as 1 F (x) = x Z x f (t) dt, dom F = R++ , 0 is convex. Hint. For each s, f (sx) is convex in x, so R1 0 f (sx) ds is convex. 3.6 Functions and epigraphs. When is the epigraph of a function a halfspace? When is the epigraph of a function a convex cone? When is the epigraph of a function a polyhedron? 3.7 Suppose f : Rn → R is convex with dom f = Rn , and bounded above on Rn . Show that f is constant. 3.8 Second-order condition for convexity. Prove that a twice differentiable function f is convex if and only if its domain is convex and ∇2 f (x) 0 for all x ∈ dom f . Hint. First consider the case f : R → R. You can use the first-order condition for convexity (which was proved on page 70). 3.9 Second-order conditions for convexity on an affine set. Let F ∈ Rn×m , x̂ ∈ Rn . The restriction of f : Rn → R to the affine set {F z + x̂ | z ∈ Rm } is defined as the function f˜ : Rm → R with f˜(z) = f (F z + x̂), dom f˜ = {z | F z + x̂ ∈ dom f }. Suppose f is twice differentiable with a convex domain. (a) Show that f˜ is convex if and only if for all z ∈ dom f˜ F T ∇2 f (F z + x̂)F 0. (b) Suppose A ∈ Rp×n is a matrix whose nullspace is equal to the range of F , i.e., AF = 0 and rank A = n − rank F . Show that f˜ is convex if and only if for all z ∈ dom f˜ there exists a λ ∈ R such that ∇2 f (F z + x̂) + λAT A 0. Hint. Use the following result: If B ∈ Sn and A ∈ Rp×n , then xT Bx ≥ 0 for all x ∈ N (A) if and only if there exists a λ such that B + λAT A 0. 3.10 An extension of Jensen’s inequality. One interpretation of Jensen’s inequality is that randomization or dithering hurts, i.e., raises the average value of a convex function: For f convex and v a zero mean random variable, we have E f (x0 + v) ≥ f (x0 ). This leads to the following conjecture. If f0 is convex, then the larger the variance of v, the larger E f (x0 + v). (a) Give a counterexample that shows that this conjecture is false. Find zero mean random variables v and w, with var(v) > var(w), a convex function f , and a point x0 , such that E f (x0 + v) < E f (x0 + w). Exercises 115 (b) The conjecture is true when v and w are scaled versions of each other. Show that E f (x0 + tv) is monotone increasing in t ≥ 0, when f is convex and v is zero mean. 3.11 Monotone mappings. A function ψ : Rn → Rn is called monotone if for all x, y ∈ dom ψ, (ψ(x) − ψ(y))T (x − y) ≥ 0. (Note that ‘monotone’ as defined here is not the same as the definition given in §3.6.1. Both definitions are widely used.) Suppose f : Rn → R is a differentiable convex function. Show that its gradient ∇f is monotone. Is the converse true, i.e., is every monotone mapping the gradient of a convex function? 3.12 Suppose f : Rn → R is convex, g : Rn → R is concave, dom f = dom g = Rn , and for all x, g(x) ≤ f (x). Show that there exists an affine function h such that for all x, g(x) ≤ h(x) ≤ f (x). In other words, if a concave function g is an underestimator of a convex function f , then we can fit an affine function between f and g. 3.13 Kullback-Leibler divergence and the information inequality. Let Dkl be the KullbackLeibler divergence, as defined in (3.17). Prove the information inequality: Dkl (u, v) ≥ 0 for all u, v ∈ Rn ++ . Also show that Dkl (u, v) = 0 if and only if u = v. Hint. The Kullback-Leibler divergence can be expressed as where f (v) = Pn i=1 Dkl (u, v) = f (u) − f (v) − ∇f (v)T (u − v), vi log vi is the negative entropy of v. 3.14 Convex-concave functions and saddle-points. We say the function f : Rn × Rm → R is convex-concave if f (x, z) is a concave function of z, for each fixed x, and a convex function of x, for each fixed z. We also require its domain to have the product form dom f = A × B, where A ⊆ Rn and B ⊆ Rm are convex. (a) Give a second-order condition for a twice differentiable function f : Rn × Rm → R to be convex-concave, in terms of its Hessian ∇2 f (x, z). (b) Suppose that f : Rn ×Rm → R is convex-concave and differentiable, with ∇f (x̃, z̃) = 0. Show that the saddle-point property holds: for all x, z, we have f (x̃, z) ≤ f (x̃, z̃) ≤ f (x, z̃). Show that this implies that f satisfies the strong max-min property: sup inf f (x, z) = inf sup f (x, z) z x x z (and their common value is f (x̃, z̃)). (c) Now suppose that f : Rn × Rm → R is differentiable, but not necessarily convexconcave, and the saddle-point property holds at x̃, z̃: f (x̃, z) ≤ f (x̃, z̃) ≤ f (x, z̃) for all x, z. Show that ∇f (x̃, z̃) = 0. Examples 3.15 A family of concave utility functions. For 0 < α ≤ 1 let uα (x) = xα − 1 , α with dom uα = R+ . We also define u0 (x) = log x (with dom u0 = R++ ). (a) Show that for x > 0, u0 (x) = limα→0 uα (x). 116 3 Convex functions (b) Show that uα are concave, monotone increasing, and all satisfy uα (1) = 0. These functions are often used in economics to model the benefit or utility of some quantity of goods or money. Concavity of uα means that the marginal utility (i.e., the increase in utility obtained for a fixed increase in the goods) decreases as the amount of goods increases. In other words, concavity models the effect of satiation. 3.16 For each of the following functions determine whether it is convex, concave, quasiconvex, or quasiconcave. (a) f (x) = ex − 1 on R. (b) f (x1 , x2 ) = x1 x2 on R2++ . (c) f (x1 , x2 ) = 1/(x1 x2 ) on R2++ . (d) f (x1 , x2 ) = x1 /x2 on R2++ . (e) f (x1 , x2 ) = x21 /x2 on R × R++ . 1−α , where 0 ≤ α ≤ 1, on R2++ . (f) f (x1 , x2 ) = xα 1 x2 3.17 Suppose p < 1, p 6= 0. Show that the function f (x) = n X p xi i=1 !1/p Pn 1/2 with dom f = Rn x )2 and ++ is concave. This includes as special cases f (x) = ( i=1 i Pn −1 the harmonic mean f (x) = ( i=1 1/xi ) . Hint. Adapt the proofs for the log-sum-exp function and the geometric mean in §3.1.5. 3.18 Adapt the proof of concavity of the log-determinant function in §3.1.5 to show the following. (a) f (X) = tr X −1 is convex on dom f = Sn ++ . (b) f (X) = (det X)1/n is concave on dom f = Sn ++ . 3.19 Nonnegative weighted sums and integrals. Pr (a) Show that f (x) = α x is a convex function of x, where α1 ≥ α2 ≥ · · · ≥ i=1 i [i] αr ≥ 0, and x[i] denotes the ith largest component of x. (You can use the fact that Pk f (x) = i=1 x[i] is convex on Rn .) (b) Let T (x, ω) denote the trigonometric polynomial T (x, ω) = x1 + x2 cos ω + x3 cos 2ω + · · · + xn cos(n − 1)ω. Show that the function f (x) = − Z 2π log T (x, ω) dω 0 is convex on {x ∈ Rn | T (x, ω) > 0, 0 ≤ ω ≤ 2π}. 3.20 Composition with an affine function. Show that the following functions f : Rn → R are convex. (a) f (x) = kAx − bk, where A ∈ Rm×n , b ∈ Rm , and k · k is a norm on Rm . (b) f (x) = − (det(A0 + x1 A1 + · · · + xn An ))1/m , on {x | A0 + x1 A1 + · · · + xn An ≻ 0}, where Ai ∈ Sm . (c) f (X) = tr (A0 + x1 A1 + · · · + xn An )−1 , on {x | A0 +x1 A1 +· · · +xn An ≻ 0}, where Ai ∈ Sm . (Use the fact that tr(X −1 ) is convex on Sm ++ ; see exercise 3.18.) Exercises 117 3.21 Pointwise maximum and supremum. Show that the following functions f : Rn → R are convex. (a) f (x) = maxi=1,...,k kA(i) x − b(i) k, where A(i) ∈ Rm×n , b(i) ∈ Rm and k · k is a norm on Rm . Pr (b) f (x) = |x|[i] on Rn , where |x| denotes the vector with |x|i = |xi | (i.e., |x| is i=1 the absolute value of x, componentwise), and |x|[i] is the ith largest component of |x|. In other words, |x|[1] , |x|[2] , . . . , |x|[n] are the absolute values of the components of x, sorted in nonincreasing order. 3.22 Composition rules. Show that the following functions are convex. Pm Pm T T (a) f (x) = − log(− log( P eai x+bi )) on dom f = {x | eai x+bi < 1}. You can i=1 i=1 n yi use the fact that log( i=1 e ) is convex. √ (b) f (x, u, v) = − uv − xT x on dom f = {(x, u, v) | uv > xT x, u, v > 0}. Use the √ fact that xT x/u is convex in (x, u) for u > 0, and that − x1 x2 is convex on R2++ . (c) f (x, u, v) = − log(uv − xT x) on dom f = {(x, u, v) | uv > xT x, u, v > 0}. (d) f (x, t) = −(tp − kxkpp )1/p where p > 1 and dom f = {(x, t) | t ≥ kxkp }. You can use the fact that kxkpp /up−1 is convex in (x, u) for u > 0 (see exercise 3.23), and that −x1/p y 1−1/p is convex on R2+ (see exercise 3.16). (e) f (x, t) = − log(tp − kxkpp ) where p > 1 and dom f = {(x, t) | t > kxkp }. You can use the fact that kxkpp /up−1 is convex in (x, u) for u > 0 (see exercise 3.23). 3.23 Perspective of a function. (a) Show that for p > 1, f (x, t) = kxkpp |x1 |p + · · · + |xn |p = tp−1 tp−1 is convex on {(x, t) | t > 0}. (b) Show that kAx + bk22 cT x + d is convex on {x | cT x + d > 0}, where A ∈ Rm×n , b ∈ Rm , c ∈ Rn and d ∈ R. f (x) = 3.24 Some functions on the probability simplex. Let x be a real-valued random variable which takes values in {a1 , . . . , an } where a1 < a2 < · · · < an , with prob(x = ai ) = pi , i = 1, . . . , n. For each of the following functions of p (on the probability simplex {p ∈ T Rn + | 1 p = 1}), determine if the function is convex, concave, quasiconvex, or quasiconcave. (a) E x. (b) prob(x ≥ α). (c) prob(α ≤ x ≤ β). (d) Pn i=1 pi log pi , the negative entropy of the distribution. (e) var x = E(x − E x)2 . (f) quartile(x) = inf{β | prob(x ≤ β) ≥ 0.25}. (g) The cardinality of the smallest set A ⊆ {a1 , . . . , an } with probability ≥ 90%. (By cardinality we mean the number of elements in A.) (h) The minimum width interval that contains 90% of the probability, i.e., inf {β − α | prob(α ≤ x ≤ β) ≥ 0.9} . 118 3 Convex functions 3.25 Maximum probability distance between distributions. Let p, q ∈ Rn represent two probability distributions on {1, . . . , n} (so p, q 0, 1T p = 1T q = 1). We define the maximum probability distance dmp (p, q) between p and q as the maximum difference in probability assigned by p and q, over all events: dmp (p, q) = max{| prob(p, C) − prob(q, C)| | C ⊆ {1, . . . , n}}. Here prob(p, C) is the probability of C, under the distribution p, i.e., prob(p, C) = P p. i∈C i Pn Find a simple expression for dmp , involving kp − qk1 = i=1 |pi − qi |, and show that dmp is a convex function on Rn × Rn . (Its domain is {(p, q) | p, q 0, 1T p = 1T q = 1}, but it has a natural extension to all of Rn × Rn .) 3.26 More functions of eigenvalues. Let λ1 (X) ≥ λ2 (X) ≥ · · · ≥ λn (X) denote the eigenvalues of a matrix X ∈ Sn . We have already seen several functions of the eigenvalues that are convex or concave functions of X. • The maximum eigenvalue λ1 (X) is convex (example 3.10). The minimum eigenvalue λn (X) is concave. • The sum of the eigenvalues (or trace), tr X = λ1 (X) + · · · + λn (X), is linear. • The eigenvalues (or trace of the inverse), tr(X −1 ) = Pn sum of the inverses of the n 1/λi (X), is convex on S++ (exercise 3.18). i=1 Qn • The geometric mean of the eigenvalues, (det X)1/n =P( i=1 λi (X))1/n , and the n logarithm of the product of the eigenvalues, log det X = i=1 log λi (X), are concave n on X ∈ S++ (exercise 3.18 and page 74). In this problem we explore some more functions of eigenvalues, by exploiting variational characterizations. Pk (a) Sum of k largest eigenvalues. Show that i=1 λi (X) is convex on Sn . Hint. [HJ85, page 191] Use the variational characterization k X i=1 λi (X) = sup{tr(V T XV ) | V ∈ Rn×k , V T V = I}. Qn (b) Geometric mean of k smallest eigenvalues. Show that ( i=n−k+1 λi (X))1/k is concave on Sn ++ . Hint. [MO79, page 513] For X ≻ 0, we have n Y !1/k λi (X) i=n−k+1 = 1 inf{tr(V T XV ) | V ∈ Rn×k , det V T V = 1}. k (c) Log of product of k smallest eigenvalues. Show that on Sn ++ . Hint. [MO79, page 513] For X ≻ 0, n Y λi (X) = inf i=n−k+1 ( k Y (V XV )ii z w T i=1 Pn i=n−k+1 V ∈R n×k log λi (X) is concave T , V V =I ) . 3.27 Diagonal elements of Cholesky factor. Each X ∈ Sn ++ has a unique Cholesky factorization X = LLT , where L is lower triangular, with Lii > 0. Show that Lii is a concave function of X (with domain Sn ++ ). Hint. Lii can be expressed as Lii = (w − z T Y −1 z)1/2 , where is the leading i × i submatrix of X. Y zT Exercises 119 Operations that preserve convexity 3.28 Expressing a convex function as the pointwise supremum of a family of affine functions. In this problem we extend the result proved on page 83 to the case where dom f 6= Rn . Let f : Rn → R be a convex function. Define f˜ : Rn → R as the pointwise supremum of all affine functions that are global underestimators of f : f˜(x) = sup{g(x) | g affine, g(z) ≤ f (z) for all z}. (a) Show that f (x) = f˜(x) for x ∈ int dom f . (b) Show that f = f˜ if f is closed (i.e., epi f is a closed set; see §A.3.3). 3.29 Representation of piecewise-linear convex functions. A function f : Rn → R, with dom f = Rn , is called piecewise-linear if there exists a partition of Rn as Rn = X1 ∪ X2 ∪ · · · ∪ XL , where int Xi 6= ∅ and int Xi ∩ int Xj = ∅ for i 6= j, and a family of affine functions aT1 x + b1 , . . . , aTL x + bL such that f (x) = aTi x + bi for x ∈ Xi . Show that this means that f (x) = max{aT1 x + b1 , . . . , aTL x + bL }. 3.30 Convex hull or envelope of a function. The convex hull or convex envelope of a function f : Rn → R is defined as g(x) = inf{t | (x, t) ∈ conv epi f }. Geometrically, the epigraph of g is the convex hull of the epigraph of f . Show that g is the largest convex underestimator of f . In other words, show that if h is convex and satisfies h(x) ≤ f (x) for all x, then h(x) ≤ g(x) for all x. 3.31 [Roc70, page 35] Largest homogeneous underestimator. Let f be a convex function. Define the function g as f (αx) g(x) = inf . α>0 α (a) Show that g is homogeneous (g(tx) = tg(x) for all t ≥ 0). (b) Show that g is the largest homogeneous underestimator of f : If h is homogeneous and h(x) ≤ f (x) for all x, then we have h(x) ≤ g(x) for all x. (c) Show that g is convex. 3.32 Products and ratios of convex functions. In general the product or ratio of two convex functions is not convex. However, there are some results that apply to functions on R. Prove the following. (a) If f and g are convex, both nondecreasing (or nonincreasing), and positive functions on an interval, then f g is convex. (b) If f , g are concave, positive, with one nondecreasing and the other nonincreasing, then f g is concave. (c) If f is convex, nondecreasing, and positive, and g is concave, nonincreasing, and positive, then f /g is convex. 3.33 Direct proof of perspective theorem. Give a direct proof that the perspective function g, as defined in §3.2.6, of a convex function f is convex: Show that dom g is a convex set, and that for (x, t), (y, s) ∈ dom g, and 0 ≤ θ ≤ 1, we have g(θx + (1 − θ)y, θt + (1 − θ)s) ≤ θg(x, t) + (1 − θ)g(y, s). 3.34 The Minkowski function. The Minkowski function of a convex set C is defined as MC (x) = inf{t > 0 | t−1 x ∈ C}. 120 3 Convex functions (a) Draw a picture giving a geometric interpretation of how to find MC (x). (b) Show that MC is homogeneous, i.e., MC (αx) = αMC (x) for α ≥ 0. (c) What is dom MC ? (d) Show that MC is a convex function. (e) Suppose C is also closed, bounded, symmetric (if x ∈ C then −x ∈ C), and has nonempty interior. Show that MC is a norm. What is the corresponding unit ball? 3.35 Support function calculus. Recall that the support function of a set C ⊆ Rn is defined as SC (y) = sup{y T x | x ∈ C}. On page 81 we showed that SC is a convex function. (a) Show that SB = Sconv B . (b) Show that SA+B = SA + SB . (c) Show that SA∪B = max{SA , SB }. (d) Let B be closed and convex. Show that A ⊆ B if and only if SA (y) ≤ SB (y) for all y. Conjugate functions 3.36 Derive the conjugates of the following functions. (a) Max function. f (x) = maxi=1,...,n xi on Rn . (b) Sum of largest elements. f (x) = Pr i=1 x[i] on Rn . (c) Piecewise-linear function on R. f (x) = maxi=1,...,m (ai x + bi ) on R. You can assume that the ai are sorted in increasing order, i.e., a1 ≤ · · · ≤ am , and that none of the functions ai x + bi is redundant, i.e., for each k there is at least one x with f (x) = ak x + bk . (d) Power function. f (x) = xp on R++ , where p > 1. Repeat for p < 0. Q (e) Negative geometric mean. f (x) = −( xi )1/n on Rn ++ . (f) Negative generalized logarithm for second-order cone. f (x, t) = − log(t2 − xT x) on {(x, t) ∈ Rn × R | kxk2 < t}. 3.37 Show that the conjugate of f (X) = tr(X −1 ) with dom f = Sn ++ is given by f ∗ (Y ) = −2 tr(−Y )1/2 , dom f ∗ = −Sn +. Hint. The gradient of f is ∇f (X) = −X −2 . 3.38 Young’s inequality. Let f : R → R be an increasing function, with f (0) = 0, and let g be its inverse. Define F and G as F (x) = Z x 0 f (a) da, G(y) = Z y g(a) da. 0 Show that F and G are conjugates. Give a simple graphical interpretation of Young’s inequality, xy ≤ F (x) + G(y). 3.39 Properties of conjugate functions. (a) Conjugate of convex plus affine function. Define g(x) = f (x) + cT x + d, where f is convex. Express g ∗ in terms of f ∗ (and c, d). (b) Conjugate of perspective. Express the conjugate of the perspective of a convex function f in terms of f ∗ . Exercises 121 (c) Conjugate and minimization. Let f (x, z) be convex in (x, z) and define g(x) = inf z f (x, z). Express the conjugate g ∗ in terms of f ∗ . As an application, express the conjugate of g(x) = inf z {h(z) | Az + b = x}, where h is convex, in terms of h∗ , A, and b. (d) Conjugate of conjugate. Show that the conjugate of the conjugate of a closed convex function is itself: f = f ∗∗ if f is closed and convex. (A function is closed if its epigraph is closed; see §A.3.3.) Hint. Show that f ∗∗ is the pointwise supremum of all affine global underestimators of f . Then apply the result of exercise 3.28. 3.40 Gradient and Hessian of conjugate function. Suppose f : Rn → R is convex and twice continuously differentiable. Suppose ȳ and x̄ are related by ȳ = ∇f (x̄), and that ∇2 f (x̄) ≻ 0. (a) Show that ∇f ∗ (ȳ) = x̄. (b) Show that ∇2 f ∗ (ȳ) = ∇2 f (x̄)−1 . 3.41 Conjugate of negative normalized entropy. Show that the conjugate of the negative normalized entropy f (x) = n X xi log(xi /1T x), i=1 with dom f = Rn ++ , is given by ∗ f (y) = 0 +∞ Pn eyi ≤ 1 otherwise. i=1 Quasiconvex functions 3.42 Approximation width. Let f0 , . . . , fn : R → R be given continuous functions. We consider the problem of approximating f0 as a linear combination of f1 , . . . , fn . For x ∈ Rn , we say that f = x1 f1 + · · · + xn fn approximates f0 with tolerance ǫ > 0 over the interval [0, T ] if |f (t) − f0 (t)| ≤ ǫ for 0 ≤ t ≤ T . Now we choose a fixed tolerance ǫ > 0 and define the approximation width as the largest T such that f approximates f0 over the interval [0, T ]: W (x) = sup{T | |x1 f1 (t) + · · · + xn fn (t) − f0 (t)| ≤ ǫ for 0 ≤ t ≤ T }. Show that W is quasiconcave. 3.43 First-order condition for quasiconvexity. Prove the first-order condition for quasiconvexity given in §3.4.3: A differentiable function f : Rn → R, with dom f convex, is quasiconvex if and only if for all x, y ∈ dom f , f (y) ≤ f (x) =⇒ ∇f (x)T (y − x) ≤ 0. Hint. It suffices to prove the result for a function on R; the general result follows by restriction to an arbitrary line. 3.44 Second-order conditions for quasiconvexity. In this problem we derive alternate representations of the second-order conditions for quasiconvexity given in §3.4.3. Prove the following. (a) A point x ∈ dom f satisfies (3.21) if and only if there exists a σ such that ∇2 f (x) + σ∇f (x)∇f (x)T 0. (3.26) It satisfies (3.22) for all y 6= 0 if and only if there exists a σ such ∇2 f (x) + σ∇f (x)∇f (x)T ≻ 0. 2 Hint. We can assume without loss of generality that ∇ f (x) is diagonal. (3.27) 122 3 Convex functions (b) A point x ∈ dom f satisfies (3.21) if and only if either ∇f (x) = 0 and ∇2 f (x) 0, or ∇f (x) 6= 0 and the matrix H(x) = ∇2 f (x) ∇f (x)T ∇f (x) 0 has exactly one negative eigenvalue. It satisfies (3.22) for all y 6= 0 if and only if H(x) has exactly one nonpositive eigenvalue. Hint. You can use the result of part (a). The following result, which follows from the eigenvalue interlacing theorem in linear algebra, may also be useful: If B ∈ Sn and a ∈ Rn , then B a λn ≥ λn (B). aT 0 3.45 Use the first and second-order conditions for quasiconvexity given in §3.4.3 to verify quasiconvexity of the function f (x) = −x1 x2 , with dom f = R2++ . 3.46 Quasilinear functions with domain Rn . A function on R that is quasilinear (i.e., quasiconvex and quasiconcave) is monotone, i.e., either nondecreasing or nonincreasing. In this problem we consider a generalization of this result to functions on Rn . Suppose the function f : Rn → R is quasilinear and continuous with dom f = Rn . Show that it can be expressed as f (x) = g(aT x), where g : R → R is monotone and a ∈ Rn . In other words, a quasilinear function with domain Rn must be a monotone function of a linear function. (The converse is also true.) Log-concave and log-convex functions 3.47 Suppose f : Rn → R is differentiable, dom f is convex, and f (x) > 0 for all x ∈ dom f . Show that f is log-concave if and only if for all x, y ∈ dom f , f (y) ≤ exp f (x) ∇f (x)T (y − x) f (x) . 3.48 Show that if f : Rn → R is log-concave and a ≥ 0, then the function g = f − a is log-concave, where dom g = {x ∈ dom f | f (x) > a}. 3.49 Show that the following functions are log-concave. (a) Logistic function: f (x) = ex /(1 + ex ) with dom f = R. (b) Harmonic mean: f (x) = (c) Product over sum: 1 , 1/x1 + · · · + 1/xn Qn xi , f (x) = Pi=1 n x i=1 i dom f = Rn ++ . dom f = Rn ++ . (d) Determinant over trace: f (X) = det X , tr X dom f = Sn ++ . Exercises 123 3.50 Coefficients of a polynomial as a function of the roots. Show that the coefficients of a polynomial with real negative roots are log-concave functions of the roots. In other words, the functions ai : Rn → R, defined by the identity sn + a1 (λ)sn−1 + · · · + an−1 (λ)s + an (λ) = (s − λ1 )(s − λ2 ) · · · (s − λn ), are log-concave on −Rn ++ . Hint. The function Sk (x) = X 1≤i1