Tremendous amount of data are being generated and saved in many complex engineering and social systems every day. It is significant and feasible to utilize the big data to make better decisions by machine learning techniques. In this paper, we focus on batch reinforcement learning (RL) algorithms for discounted Markov decision processes (MDPs) with large discrete or continuous state spaces, aiming to learn the best possible policy given a fixed amount of training data. The batch RL algorithms with handcrafted feature representations work well for low-dimensional MDPs. However, for many real-world RL tasks which often involve high-dimensional state spaces, it is difficult and even infeasible to use feature engineering methods to design features for value function approximation. To cope with high-dimensional RL problems, the desire to obtain data-driven features has led to a lot of works in incorporating feature selection and feature learning into traditional batch RL algorithms. In this paper, we provide a comprehensive survey on automatic feature selection and unsupervised feature learning for high-dimensional batch RL. Moreover, we present recent theoretical developments on applying statistical learning to establish finite-sample error bounds for batch RL algorithms based on weighted L p norms. Finally, we derive some future directions in the research of RL algorithms, theories and applications.
design data hand book by k mahadevan free 112
Download Zip: https://byltly.com/2vFkFE
Since many real-world RL tasks often involve high-dimensional state spaces, it is difficult to use feature engineering methods to design features for function approximators. To cope with high-dimensional RL problems, the desire to design data-driven features has led to a lot of works in incorporating feature selection and feature learning into traditional batch RL algorithms. Automatic feature selection is to select features from a given set of features by using regularization, matching pursuit, random projection, etc. Automatic feature learning is to learn features from data by learning the structure of the state space using unsupervised learning methods, such as manifold learning, spectral learning, deep learning, etc. In this section, we present a comprehensive survey on these promising research works.
A CNN is a multilayer neural network which reduces the number of weight parameters by sharing weights between the local receptive fields. The pretraining phase is usually not required. Mnih et al. [101] presented a deep Q learning algorithm to play Atari 2600 games successfully. This algorithm can learn control policies directly from high-dimensional, raw video data without hand-designed features. A CNN was used as the action-value function approximator. To scale to large data set, the stochastic gradient descent instead of batch update was used to adapt the weights. An experience replay idea was used to deal with the problem of correlated data and non-stationary distributions. This algorithm outperformed all previous approaches on six of the games and even surpassed a human expert on three of them.
Batch RL is a model-free and data efficient technique, and can learn to make decisions from a large amount of data. For high-dimensional RL problems, it is necessary to develop RL algorithms which can select or learn features automatically from data. In this paper, we have provided a survey on recent progress in feature selection and feature learning for high-dimensional batch RL problems. The automatic feature selection techniques like regularization, matching pursuit, random projection can select suitable features for batch RL algorithms from a set of features given by the designer. Unsupervised feature learning methods, such as manifold learning, spectral learning, and deep learning, can learn representations or features, and thus hold great promise for high-dimensional RL algorithms. It will be an advanced intelligent control method by combining unsupervised learning and supervised learning with RL. Furthermore, we have also presented a survey on recent theoretical progress in applying statistical machine learning to establish rigorous convergence and performance analysis for batch RL algorithms with function approximation architectures.
To further promote the development of RL, we think that the following directions need to be considered in the near future. Most existing batch RL methods assume that the action space is finite, but many real-world systems have continuous action spaces. When the action space is large or continuous, it is difficult to compute the greedy policy at each iteration. Therefore, it is important to develop RL algorithms which can solve MDPs with large or continuous action spaces. RL has a strong relationship with supervised learning and unsupervised learning, so it is quite appealing to introduce more machine learning methods to RL problems. For example, there have been some research on combining transfer learning with RL[114], aiming to solve different tasks with transferred knowledge. When the training data set is large, the computational cost of batch RL algorithms will become a serious problem. It will be quite promising to parallelize the existing RL algorithms in the framework of parallel or distributed computing to deal with large scale problems. For example, the MapReduce framework[115] was used to design parallel RL algorithms. Last but not least, it is significant to apply the batch RL algorithms based on feature selection or feature learning to solve real-world problems in power grid, transportation, health care, etc.
Historically, scientific research expeditions starting in the 19th century have provided occasional sections measuring deep ocean properties (Roemmich et al., 201211). Greater spatial and temporal coverage of temperatures down to about 700 m was obtained using expendable bathythermographs along commercial shipping tracks starting in the 1970s (Abraham et al., 201312). Since the early 2000s, thousands of autonomous profiling floats (Argo floats) have provided high-quality temperature and salinity profiles of the upper 2000 m in ice-free regions of the ocean (Abraham et al., 201313; Riser et al., 201614). Further advances in autonomous floats have been developed that now allow these floats to operate in seasonally ice covered oceans (Wong and Riser, 201115; Wong and Riser, 201315), and more recently to profile the entire depth of the water column down to 4000 or 6000 m (Johnson et al., 201517; Zilberman, 201718) and to include biogeochemical properties (Johnson et al., 201719). Autonomous floats have revolutionised our sampling and accuracy of the global ocean temperature and salinity records and increased certainty and confidence in global estimates of the earth heat (temperature) budget, particularly since 2004 (Von Schuckmann et al., 2014; Roemmich et al., 201520; Riser et al., 201621), as demonstrated by the convergence of observational estimates of the changes in the heat budget of the upper 2000 m (Figure 5.1). New findings using data collected from such observing platforms mark significant progress since AR5.
Knowledge limitations can include a lack of data (Sutton-Grier et al., 2015; Wigand et al., 2017; Romañach et al., 2018), for example, when an absence of baseline data may undermine coastline management (Perkins et al., 2015). Scale-relevant information may be required for local decision making (Robins et al., 2016; Thorne et al., 2017) and to comply with localised design requirements (Vikolainen et al., 2017). Other knowledge barriers include inherent uncertainties in models (Schaeffer-Novelli et al., 2016) and complexity of coastal systems (Wigand et al., 2017). A more nuanced knowledge barrier is the disconnect between scientific, community and decision making processes (Romañach et al., 2018).
DL, a subset of ML (Fig. 2), is inspired by the information processing patterns found in the human brain. DL does not require any human-designed rules to operate; rather, it uses a large amount of data to map the given input to specific labels. DL is designed using numerous layers of algorithms (artificial neural networks, or ANNs), each of which provides a different interpretation of the data that has been fed to them [18, 25].
Robustness: In general, precisely designed features are not required in DL techniques. Instead, the optimized features are learned in an automated fashion related to the task under consideration. Thus, robustness to the usual changes of the input data is attained.
In this technique, the learning process is based on semi-labeled datasets. Occasionally, generative adversarial networks (GANs) and DRL are employed in the same way as this technique. In addition, RNNs, which include GRUs and LSTMs, are also employed for partially supervised learning. One of the advantages of this technique is to minimize the amount of labeled data needed. On other the hand, One of the disadvantages of this technique is irrelevant input feature present training data could furnish incorrect decisions. Text document classifier is one of the most popular example of an application of semi-supervised learning. Due to difficulty of obtaining a large amount of labeled text documents, semi-supervised learning is ideal for text document classification task.
RvNN can achieve predictions in a hierarchical structure also classify the outputs utilizing compositional vectors [57]. Recursive auto-associative memory (RAAM) [58] is the primary inspiration for the RvNN development. The RvNN architecture is generated for processing objects, which have randomly shaped structures like graphs or trees. This approach generates a fixed-width distributed representation from a variable-size recursive-data structure. The network is trained using an introduced back-propagation through structure (BTS) learning system [58]. The BTS system tracks the same technique as the general-back propagation algorithm and has the ability to support a treelike structure. Auto-association trains the network to regenerate the input-layer pattern at the output layer. RvNN is highly effective in the NLP context. Socher et al. [59] introduced RvNN architecture designed to process inputs from a variety of modalities. These authors demonstrate two applications for classifying natural language sentences: cases where each sentence is split into words and nature images, and cases where each image is separated into various segments of interest. RvNN computes a likely pair of scores for merging and constructs a syntactic tree. Furthermore, RvNN calculates a score related to the merge plausibility for every pair of units. Next, the pair with the largest score is merged within a composition vector. Following every merge, RvNN generates (a) a larger area of numerous units, (b) a compositional vector of the area, and (c) a label for the class (for instance, a noun phrase will become the class label for the new area if two units are noun words). The compositional vector for the entire area is the root of the RvNN tree structure. An example RvNN tree is shown in Fig. 5. RvNN has been employed in several applications [60,61,62]. 2ff7e9595c
Comments