Log In Sign Up

Asynchronous training of quantum reinforcement learning

by   Samuel Yen-Chi Chen, et al.

The development of quantum machine learning (QML) has received a lot of interest recently thanks to developments in both quantum computing (QC) and machine learning (ML). One of the ML paradigms that can be utilized to address challenging sequential decision-making issues is reinforcement learning (RL). It has been demonstrated that classical RL can successfully complete many difficult tasks. A leading method of building quantum RL agents relies on the variational quantum circuits (VQC). However, training QRL algorithms with VQCs requires significant amount of computational resources. This issue hurdles the exploration of various QRL applications. In this paper, we approach this challenge through asynchronous training QRL agents. Specifically, we choose the asynchronous training of advantage actor-critic variational quantum policies. We demonstrate the results via numerical simulations that within the tasks considered, the asynchronous training of QRL agents can reach performance comparable to or superior than classical agents with similar model sizes and architectures.


Quantum deep recurrent reinforcement learning

Recent advances in quantum computing (QC) and machine learning (ML) have...

Variational Quantum Soft Actor-Critic

Quantum computing has a superior advantage in tackling specific problems...

Variational Quantum Soft Actor-Critic for Robotic Arm Control

Deep Reinforcement Learning is emerging as a promising approach for the ...

Quantum Multi-Agent Reinforcement Learning via Variational Quantum Circuit Design

In recent years, quantum computing (QC) has been getting a lot of attent...

Decoding surface codes with deep reinforcement learning and probabilistic policy reuse

Quantum computing (QC) promises significant advantages on certain hard c...

Quantum Architecture Search via Continual Reinforcement Learning

Quantum computing has promised significant improvement in solving diffic...

Robust Dual View Depp Agent

Motivated by recent advance of machine learning using Deep Reinforcement...

I Introduction

Quantum computing (QC) has been posited as a means of achieving computational superiority for certain tasks that classical computers struggle to solve Nielsen and Chuang (2010). Despite this potential, the lack of error-correction in current quantum computers has made it challenging to effectively implement complex quantum circuits on these ”noisy intermediate-scale quantum” (NISQ) devices Preskill (2018). To harness the quantum advantages offered by NISQ devices, the development of specialized quantum circuit architectures is necessary.

Recent advances in the hybrid quantum-classical computing framework Bharti et al. (2022) that utilizes both classical and quantum computing. Under this approach, certain computational tasks that are expected to benefit from quantum processing are executed on a quantum computer, while others, such as gradient calculations, are performed on classical computers. This hybrid approach aims to take advantage of the strengths of both types of computing to address a wide range of tasks. Hybrid algorithms that utilize variational quantum circuits (VQC) have proven to be effective in a variety of machine learning tasks. VQCs are a subclass of quantum circuits that possess tunable parameters, and their incorporation into QML models has demonstrated success in a wide range of tasks Bharti et al. (2022); Cerezo et al. (2021).

Reinforcement learning (RL) is a branch of machine learning that deals with sequential decision making tasks. Deep neural network-based RL has achieved remarkable results in complicated tasks with human-level

Mnih et al. (2015) or super-human performance Silver et al. (2017). However, quantum RL is a developing field with many unresolved issues and challenges. The majority of existing quantum RL models are based on VQC Chen et al. (2020); Lockwood and Si (2020); Skolik et al. (2022); Jerbi et al. (2021); Hsiao et al. (2022). Although these models have been shown to perform well in a variety of benchmark tasks, training them requires a significant amount of computational resources. The long training time limits the exploration of quantum RL’s broad application possibilities. We propose an asynchronous training framework for quantum RL agents in this paper. We focus on the asynchronous training of advantage actor-critic quantum policies using multiple instances of agents running in parallel.

We show, using numerical simulations, that quantum models may outperform or be similar to classical models in the various benchmark tasks considered. Furthermore, the suggested training approach has the practical advantage of requiring significantly less time for training, allowing for more quantum RL applications.

The structure of this paper is as follows: In Section II, we provide an overview of relevant prior work and compare our proposal to these approaches. In Section III, we provide a brief overview of the necessary background in reinforcement learning. In Section IV, we introduce the concept of variational quantum circuits (VQCs), which serve as the building blocks of our quantum reinforcement learning agents. In Section V, we present our proposed quantum A3C framework. In Section VI, we describe our experimental setup and present our results. Finally, in Section VII, we offer some concluding remarks.

Ii Relevant Works

The work that gave rise to quantum reinforcement learning (QRL) Meyer et al. (2022b) may be traced back to Dong et al. (2008). However, the framework demands a quantum environment, which may not be met in most real-world situations. Further studies utilizing Grover-like methods include Wiedemann et al. (2022); Sannia et al. (2022). Quantum linear system solvers are also used to implement quantum policy iteration Cherrat et al. (2022). We will concentrate on recent advancements in VQC-based QRL dealing with classical environments. The first VQC-based QRL Chen et al. (2020), which is the quantum version of deep -learning (DQN), considers discrete observation and action spaces in the testing environments such as Frozen-Lake and Cognitive-Radio. Later, more sophisticated efforts in the area of quantum DQN take into account continuous observation spaces like Cart-Pole Lockwood and Si (2020); Skolik et al. (2022)

. A further development along this direction includes the using of quantum recurrent neural networks such as QLSTM as the value function approximator

Chen (2022) to tackle challenges such as partial observability or environments requiring longer memory of previous steps. Various methods such as hybrid quantum-classical linear solver are developed to find value functions CHEN et al. (2020). A further improvement of DQN which can improve the agent convergence such as Double DQN (DDQN) are also implemented within VQC framework in the work Heimann et al. (2022), in which the authors apply QRL to solve robot navigation task. Recent advances in QRL have led to the development of frameworks that aim to learn policy functions, denoted as , directly. These frameworks are able to learn the optimal policy for a given problem, in addition to learning value functions such as the -function. For example, the paper Jerbi et al. (2021) describes the quantum policy gradient RL through the use of REINFORCE algorithm. Then, the work Hsiao et al. (2022) consider an improved policy gradient algorithm called PPO with VQCs and show that even with a small number of parameters, quantum models can outperform their classical counterparts. Provable quantum advantages of policy gradient are shown in the work Jerbi et al. (2022). Additional research, such as the work in Meyer et al. (2022a), has explored the impact of various post-processing methods for VQC on the performance of quantum policy gradients. Several improved quantum policy gradient algorithms have been proposed in recent years, including actor-critic Schenk et al. (2022) and soft actor-critic (SAC) Lan (2021); Acuto et al. (2022). These modifications seek to further improve the efficiency and effectiveness of QRL methods. QRL has also been applied to the field of quantum control Sequeira et al. (2022) and has been extended to the multi-agent setting Yun et al. (2022a); Yan et al. (2022); Yun et al. (2022b). The work Chen et al. (2022b) were the first to explore the use of evolutionary optimization for QRL. In their work, multiple agents were initialized and run in parallel, with the top performing agents being selected as parents to generate the next generation of agents. In the work Wu et al. (2020), the authors studied the use of advanced quantum policy gradient methods, such as the deep deterministic policy gradient (DDPG) algorithm, for QRL in continuous action spaces.

In this work, we extend upon previous research on quantum policy gradient Jerbi et al. (2021); Hsiao et al. (2022); Schenk et al. (2022) by introducing an asynchronous training method for quantum policy learning. While previous approaches have employed single-threaded training, our method utilizes an asynchronous approach, which may offer practical benefits such as reduced training time through the use of multi-core CPU computing resources and the potential for utilizing multiple quantum processing units (QPUs) in the future. Our approach shares some similarities with the evolutionary QRL method presented in Chen et al. (2022b), which also utilizes parallel computing resources. However, our approach differs in that individual agents can share their gradients directly with the shared global gradient asynchronously, rather than waiting for all agents to finish before calculating fitness and creating the next generation of agents. This characteristic may further improve the efficiency of the training process. These contributions represent a novel advancement in the field of quantum reinforcement learning.

Iii Reinforcement Learning

Reinforcement learning (RL) is a machine learning framework in which an agent learns to accomplish a given goal by interacting with an environment in discrete time steps Sutton and Barto (2018). The agent observes a state at each time step and then chooses an action from the action space based on its current policy . The policy is a mapping from a specific state

to the probabilities of choosing one of the actions in

. After performing the action , the agent gets a scalar reward and the state of the following time step from the environment. For episodic tasks, the procedure is repeated across a number of time steps until the agent reaches the terminal state or the maximum number of steps permitted. Seeing the state along the training process, the agent aims to maximize the expected return, which can be expressed as the value function at state under policy , , where is the return, the total discounted reward from time step . The value function can be further expressed as , where the action-value function or Q-value function is the expected return of choosing an action in state according to the policy . The -learning is RL algorithm to optimize the via the following formula


In contrast to value-based reinforcement learning techniques, such as -learning, which rely on learning a value function and using it to guide decision-making at each time step, policy gradient methods focus on directly optimizing a policy function, denoted as , parametrized by . The parameters are updated through a gradient ascent procedure on the expected total return, . A notable example of a policy gradient algorithm is the REINFORCE algorithm, introduced in Williams (1992). In the standard REINFORCE algorithm, the parameters are updated along the direction

, which is an unbiased estimate of

. However, this policy gradient estimate often suffers from high variance, making training difficult. To reduce the variance of this estimate while maintaining its unbiasedness, a term known as the

baseline can be subtracted from the return. This baseline, denoted as , is a learned function of the state . The resulting update becomes . A common choice for the baseline in RL is an estimate of the value function . Using this choice for the baseline often results in a lower variance estimate of the policy gradient Sutton and Barto (2018). The quantity can be interpreted as the advantage of action at state . Intuitively, the advantage can be thought of as the ”goodness or badness” of action relative to the average value at state . This approach is known as the advantage actor-critic (A2C) method, where the policy is the actor and the baseline, which is the value function , is the critic Sutton and Barto (2018).

The asynchronous advantage actor-critic (A3C) algorithm Mnih et al. (2016) is a variant of the A2C method that employs multiple concurrent actors to learn the policy through parallelization. Asynchronous training of RL agents involves executing multiple agents on multiple instances of the environment, allowing the agents to encounter diverse states at any given time step. This diminished correlation between states or observations enhances the numerical stability of on-policy RL algorithms such as actor-critic Mnih et al. (2016). Furthermore, asynchronous training does not require the maintenance of a large replay memory, thus reducing memory requirements Mnih et al. (2016). By harnessing the advantages and gradients computed by a pool of actors, A3C exhibits impressive sample efficiency and robust learning performance, making it a prevalent choice in the realm of reinforcement learning.

Iv Variational Quantum Circuit

Variational quantum circuits (VQCs), also referred to as parameterized quantum circuits (PQCs), are a class of quantum circuits that contain tunable parameters. These parameters can be optimized using various techniques from classical machine learning, including gradient-based and non-gradient-based methods. A generic illustration of a VQC is in the central part of Figure 1.

The three primary components of a VQC are the encoding circuit, the variational circuit, and the quantum measurement layer. The encoding circuit, denoted as , transforms classical values into a quantum state, while the variational circuit, denoted as

, serves as the learnable part of the VQC. The quantum measurement layer, on the other hand, is utilized to extract information from the circuit. It is a common practice to repeatedly execute the circuit, also known as ”shots,” in order to obtain the expectation values of each qubit. A common choice is to use the Pauli-

expectation values. Instead of being binary integers, the values are received as floats. Additionally, other components, such as additional VQCs or classical components such as DNN, can process the values obtained from the circuit.

The VQC can operate with other classical components such as tensor networks (TN)

Chen et al. (2022b, 2021); Qi et al. (2021) and deep neural networks (NN) to perform data pre-processing such as dimensional reduction or post-processing such as scaling. We call such VQCs as dressed VQC, as shown in Figure 1. The whole model can be trained in an end-to-end manner via gradient-based Chen et al. (2021); Qi et al. (2021) or gradient-free methods Chen et al. (2022b). For the gradient-based methods, the whole model can be represented as a directed acyclic graph (DAG) and then back-propagation can be applied. The success of such end-to-end optimization relies on the capabilities of calculating the quantum gradients such as parameter-shift rule Mitarai et al. (2018). VQC-based QML models have shown success in areas such as classification Mitarai et al. (2018); Qi et al. (2021); Chehimi and Saad (2022); Chen and Yoo (2021); Chen et al. (2021)

, natural language processing

Yang et al. (2021, 2022); Di Sipio et al. (2022) and sequence modeling Chen et al. (2022d, a).

Figure 1: Hybrid variational quantum circuit (VQC) architecture. The hybrid VQC architecture includes a VQC and classical neural networks (NN) before and after the VQC. NN can be used to reduce the dimensionality of the input data and refine the outputs from the VQC.

V Quantum A3C

The proposed quantum asynchronous advantage actor-critic (QA3C) framework consists of two main components: a global shared memory and process-specific memories for each agent. The global shared memory maintains the dressed VQC policy and value parameters, which are modified when an individual process uploads its own gradients for parameter updates. Each agent has its own process-specific memory that maintains local dressed VQC policy and value parameters. These local models are used to generate actions during an episode within individual processes. When certain criteria are met, the gradients of the local model parameters are uploaded to the global shared memory, and the global model parameters are modified accordingly. The updated global model parameters are then immediately downloaded to the local agent that just uploaded its own gradients. The overall concept of QA3C is depicted in Figure 2.

Figure 2: Quantum asynchronous advantage actor-critic (A3C) learner. The proposed quantum A3C includes a global shared parameters and multiple parallel workers. The action generation process within each local agent is performed using the dressed VQC policy and value functions stored in the process-specific memories. Upon meeting certain criteria, the gradients of the local model parameters are uploaded to the global shared memory, where the global model parameters are updated. The updated global model parameters are then immediately downloaded to the local agent that just uploaded its own gradients.

We construct the quantum policy and value function with the dressed VQC as shown in Figure 1, in which the VQC follows the architecture shown in Figure 3. This VQC architecture has been studied in the work such as quantum recurrent neural networks (QRNN) Chen et al. (2022d), quantum recurrent RL Chen (2022)

, quantum convolutional neural networks

Chen et al. (2022c), federated quantum classification Chen and Yoo (2021)

and has demonstrated superior performance over their classical counterparts under certain conditions. In addition, we employ the classical DNN before and after the VQC to dimensionally reduce the data and fine-tune the outputs from the VQC, respectively. The neural network components in this hybrid architecture consist of single-layer networks for dimensionality conversion. Specifically, the network preceding the VQC is a linear layer with an input dimension equal to the size of the observation vector and an output dimension equal to the number of qubits in the VQC. The networks following the VQC are linear layers with input dimensions equal to the number of qubits in the VQC and output dimensions equal to the number of actions (for the actor function

) or 1 (for the critic function ). These layers serve to convert the output of the VQC for use in the actor-critic algorithm. The policy and value function are updated after every steps or when the agent reaches the terminal state. The details of the algorithm such as the gradient update formulas are presented in Algorithm 1.

@C=1em @R=1em —0⟩ & H & R_y(arctan(x_1)) & R_z(arctan(x_1^2)) & 1 & & & & 2 & & & & R(α_1, β_1, γ_1) & —0⟩ & H & R_y(arctan(x_2)) & R_z(arctan(x_2^2)) & & 1 & & & & 2 & & & R(α_2, β_2, γ_2) & —0⟩ & H & R_y(arctan(x_3)) & R_z(arctan(x_3^2)) & & & 1 & & & & -2& & R(α_3, β_3, γ_3) & —0⟩ & H & R_y(arctan(x_4)) & R_z(arctan(x_4^2)) & & & & -3& & & & -2& R(α_4, β_4, γ_4) & 15413.7em–
Figure 3: VQC architecture for quantum A3C. The VQC used here includes and for encoding classical values , multiple CNOT gates to entangle qubits, general unitary rotations and the final measurement. The output of the VQC consists of Pauli- expectation values, which are obtained through multiple runs (shots) of the circuit. These values are then processed by classical neural networks for further use. We use a 4-qubit system as an example here, however, it can be enlarge or shrink based on the problem of interest. In this work, the number of qubit is 8.

Vi Experiments and Results

vi.1 Testing Environments

vi.1.1 Acrobot

The Acrobot environment from OpenAI Gym Brockman et al. (2016) consists of a system with two linearly connected links, with one end fixed. The joint connecting the two links can be actuated by applying torques. The goal is to swing the free end of the chain over a predetermined height, starting from a downward hanging position, using as few steps as possible. The observation in this environment is a six-dimensional vector comprising the sine and cosine values of the two rotational joint angles, as well as their angular velocities. The agents are able to take one of three actions: applying , , or torque to the actuated joint. An action resulting in the free end reaching the target height receives a reward of and terminates the episode. Any action that does not lead to the desired height receives a reward of . The reward threshold is .

Figure 4: The Acrobat environment from OpenAI Gym.

vi.1.2 Cart-Pole

Cart-Pole is a commonly used evaluation environment for simple RL models that has been utilized as a standard example with in OpenAI Gym Brockman et al. (2016) (see Figure 5). A fixed junction connects a pole to a cart traveling horizontally over a frictionless track in this environment. The pendulum initially stands upright, and the aim is to keep it as near to its starting position as possible by moving the cart left and right. Each time step, the RL agent learns to produce the right action according on the observation it gets. The observation in this environment is a four dimensional vector containing values of the cart position, cart velocity, pole angle, and pole velocity at the tip. Every time step where the pole is near to being upright results in a award. An episode ends if the pole is inclined more than degrees from vertical or the cart moves more than units away from the center.

Figure 5: The Cart-Pole environment from OpenAI Gym.

vi.1.3 MiniGrid-SimpleCrossing

The MiniGrid-SimpleCrossing environment Chevalier-Boisvert et al. (2018) is more sophisticated, with a lot bigger observation input for the RL agent. In this scenario, the RL agent receives a dimensional vector through observation and must choose an action from the action space , which offers six options. It is important to note that the -dimensional vector is a compact and efficient representation of the environment rather than the real pixels. There are six actions ,, in the action space for the agent to choose. They are turn left, turn right, move forward, pick up an object, drop the object being carried and toggle. Only the first three of them are having actual effects in this case. The agent is expected to learn this fact. In this environment, the agent receives a reward of 1 upon reaching the goal. A penalty is subtracted from this reward based on the formula , where the maximum number of steps allowed is defined as , and is the grid size Chevalier-Boisvert et al. (2018). In this work, is set to 9. This reward scheme presents a challenge because it is sparse, meaning that the agent does not receive rewards until it reaches the goal. As shown in Figure 6, the agent (shown in red triangle) is expected to find the shortest path from the starting point to the goal (shown in green). We consider three cases in this environment: MiniGrid-SimpleCrossingS9N1-v0, MiniGrid-SimpleCrossingS9N2-v0 and MiniGrid-SimpleCrossingS9N3-v0. Here the represents the number of valid crossings across walls from the starting position to the goal.

Figure 6: The SimpleCrossing environment from MiniGrid. The three environments from MiniGrid-SimpleCrossing we consider in this work. In each environment, there are also walls which span unit on each side (not shown in the figure). (a), (b) and (c) represent examples from the MiniGrid-SimpleCrossingS9N1-v0, MiniGrid-SimpleCrossingS9N2-v0 and MiniGrid-SimpleCrossingS9N3-v0 environments, respectively.

vi.2 Hyperparameters and Model Size

In the proposed QA3C, we use the Adam optimizer with learning rate , and . The local agents will update the parameters with the global shared memory every steps. The discount factor is set to be . For the VQC, we set the number of qubits to be and two variational layers are used. Therefore, for each VQC, there are quantum parameters. Actor and critic both have their own VQC, thus the total number of quantum parameters is 96. The VQC architecture are the same across various testing environments considered in this work. As we described in the Section V, single layer networks are used before and after the VQC to convert the dimensions of data. The networks preceding the VQC have input dimensions based on the environments that the agent is to solve. For the classical benchmarks, we consider the model which are very similar to the dressed VQC model. Specifically, we keep the architecture of classical model similar to the one presented in Figure 1 while we replace the 8-qubit VQC with a single layer with input and output dimensions equal to 8. This makes the architecture very similar to the quantum model and the number of parameters are also very close. We summarize the number of parameters in Table 1.

QA3C Classical
Classical Quantum Total Total
Acrobot 148 96 244 292
Cart-Pole 107 96 203 251
SimpleCrossing 2431 96 2527 2575
Table 1: Number of parameters. We provide details on the number of parameters in the proposed QA3C model, which includes both quantum and classical components. The classical benchmarks were designed with architectures similar to the quantum model, resulting in similar model sizes.

We utilize the open-source PennyLane package

Bergholm et al. (2018)

to construct the quantum circuit models and the PyTorch as a overall machine learning framework. The number of CPU cores and hence the number of parallel agents is 80 in this work. We present simulation results in which the scores from the past 100 episodes are averaged.

vi.3 Results

vi.3.1 Acrobot

We begin by evaluating the performance of our models on the Acrobot environment. The simulation results of this experiment are presented in Figure 7. The total number of episodes was 100,000. As shown in the figure, the quantum model exhibits a gradual improvement during the early training episodes, while the classical model struggles to improve its policy. In terms of average score, the quantum model demonstrates superior performance compared to the classical model. Furthermore, the quantum model exhibits a more stable convergence pattern, without significant fluctuations or collapses after reaching optimal scores. These results suggest that the quantum model may be more robust and reliable in this environment.

Figure 7: Results: Quantum A3C in the Acrobot environment.

vi.3.2 Cart-Pole

The next experiment was conducted in the Cart-Pole environment. The total number of episodes was 100,000. As illustrated in Figure 8, the quantum model achieved significantly higher scores than the classical model. While the classical model demonstrated faster learning in the early training episodes, the quantum model eventually surpassed it and reached superior scores. These results suggest that the quantum model may be more effective in this environment.

Figure 8: Results: Quantum A3C in the CartPole environment.

vi.3.3 MiniGrid-SimpleCrossing

The final experiment was conducted in the MiniGrid-SimpleCrossing environment, comprising a total of 100,000 episodes. As depicted in Figure 9, among the three scenarios, MiniGrid-SimpleCrossingS9N1-v0, MiniGrid-SimpleCrossingS9N2-v0, and MiniGrid-SimpleCrossingS9N3-v0, the quantum model outperformed the classical model in two of the three scenarios, MiniGrid-SimpleCrossingS9N2-v0 and MiniGrid-SimpleCrossingS9N3-v0, demonstrating faster convergence and higher scores. Even in the remaining scenario, MiniGrid-SimpleCrossingS9N1-v0, the difference in performance between the two models was minor.

Figure 9: Results: Quantum A3C in the MiniGrid-SimpleCrossing environment.

Vii Conclusion

In this study, we demonstrate the effectiveness of an asynchronous training framework for quantum RL agents. Through numerical simulations, we show that in the benchmark tasks considered, advantage actor-critic quantum policies trained asynchronously can outperform or match the performance of classical models with similar architecture and sizes. This technique affords a strategy for expediting the training of quantum RL agents through parallelization, and may have potential applications in various real-world scenarios.

The views expressed in this article are those of the authors and do not represent the views of Wells Fargo. This article is for informational purposes only. Nothing contained in this article should be construed as investment advice. Wells Fargo makes no express or implied warranties and expressly disclaims all legal, tax, and accounting implications related to this article.

Appendix A Algorithms

a.1 Quantum-A3C

Define the global update parameter Assume global shared hybrid VQC policy parameter Assume global shared hybrid VQC value parameter Assume global shared episode counter Assume process-specific hybrid VQC policy parameter Assume process-specific hybrid VQC value parameter Initialize process-specific counter while  do      Reset gradients and      Set      Reset the environment and get state      while  non-terminal or  do          Perform according to policy          Receive reward and the new state          Update process-specific counter          if  or reach terminal state then               Set               for  do                                      Accumulate gradients wrt :                   Accumulate gradients wrt :               end for               Perform asynchronous update of using and of using               Update process-specific parameters from global parameters: and          end if      end while end while Algorithm 1 Quantum asynchronous advantage actor-critic learning (algorithm for each actor-learner process)


  • A. Acuto, P. Barillà, L. Bozzolo, M. Conterno, M. Pavese, and A. Policicchio (2022) Variational quantum soft actor-critic for robotic arm control. arXiv preprint arXiv:2212.11681. Cited by: §II.
  • V. Bergholm, J. Izaac, M. Schuld, C. Gogolin, C. Blank, K. McKiernan, and N. Killoran (2018) Pennylane: automatic differentiation of hybrid quantum-classical computations. arXiv preprint arXiv:1811.04968. Cited by: §VI.2.
  • K. Bharti, A. Cervera-Lierta, T. H. Kyaw, T. Haug, S. Alperin-Lea, A. Anand, M. Degroote, H. Heimonen, J. S. Kottmann, T. Menke, et al. (2022) Noisy intermediate-scale quantum algorithms. Reviews of Modern Physics 94 (1), pp. 015004. Cited by: §I.
  • G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schulman, J. Tang, and W. Zaremba (2016) Openai gym. arXiv preprint arXiv:1606.01540. Cited by: §VI.1.1, §VI.1.2.
  • M. Cerezo, A. Arrasmith, R. Babbush, S. C. Benjamin, S. Endo, K. Fujii, J. R. McClean, K. Mitarai, X. Yuan, L. Cincio, et al. (2021) Variational quantum algorithms. Nature Reviews Physics 3 (9), pp. 625–644. Cited by: §I.
  • M. Chehimi and W. Saad (2022) Quantum federated learning with quantum data. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8617–8621. Cited by: §IV.
  • C. CHEN, K. SHIBA, M. SOGABE, K. SAKAMOTO, and T. SOGABE (2020) Hybrid quantum-classical ulam-von neumann linear solver-based quantum dynamic programing algorithm. Proceedings of the Annual Conference of JSAI JSAI2020 (), pp. 2K6ES203–2K6ES203. External Links: Document Cited by: §II.
  • S. Y. Chen, D. Fry, A. Deshmukh, V. Rastunkov, and C. Stefanski (2022a) Reservoir computing via quantum recurrent neural networks. arXiv preprint arXiv:2211.02612. Cited by: §IV.
  • S. Y. Chen, C. Huang, C. Hsing, H. Goan, and Y. Kao (2022b) Variational quantum reinforcement learning via evolutionary optimization. Machine Learning: Science and Technology 3 (1), pp. 015025. Cited by: §II, §II, §IV.
  • S. Y. Chen, C. Huang, C. Hsing, and Y. Kao (2021)

    An end-to-end trainable hybrid classical-quantum classifier

    Machine Learning: Science and Technology 2 (4), pp. 045021. Cited by: §IV.
  • S. Y. Chen, T. Wei, C. Zhang, H. Yu, and S. Yoo (2022c) Quantum convolutional neural networks for high energy physics data analysis. Physical Review Research 4 (1), pp. 013231. Cited by: §V.
  • S. Y. Chen, C. H. Yang, J. Qi, P. Chen, X. Ma, and H. Goan (2020) Variational quantum circuits for deep reinforcement learning. IEEE Access 8, pp. 141007–141024. Cited by: §I, §II.
  • S. Y. Chen, S. Yoo, and Y. L. Fang (2022d)

    Quantum long short-term memory

    In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8622–8626. Cited by: §IV, §V.
  • S. Y. Chen and S. Yoo (2021) Federated quantum machine learning. Entropy 23 (4), pp. 460. Cited by: §IV, §V.
  • S. Y. Chen (2022) Quantum deep recurrent reinforcement learning. arXiv preprint arXiv:2210.14876. Cited by: §II, §V.
  • E. A. Cherrat, I. Kerenidis, and A. Prakash (2022) Quantum reinforcement learning via policy iteration. arXiv preprint arXiv:2203.01889. Cited by: §II.
  • M. Chevalier-Boisvert, L. Willems, and S. Pal (2018) Minimalistic gridworld environment for openai gym. GitHub. Note: Cited by: §VI.1.3.
  • R. Di Sipio, J. Huang, S. Y. Chen, S. Mangini, and M. Worring (2022) The dawn of quantum natural language processing. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8612–8616. Cited by: §IV.
  • D. Dong, C. Chen, H. Li, and T. Tarn (2008) Quantum reinforcement learning. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics) 38 (5), pp. 1207–1220. Cited by: §II.
  • D. Heimann, H. Hohenfeld, F. Wiebe, and F. Kirchner (2022) Quantum deep reinforcement learning for robot navigation tasks. arXiv preprint arXiv:2202.12180. Cited by: §II.
  • J. Hsiao, Y. Du, W. Chiang, M. Hsieh, and H. Goan (2022) Unentangled quantum reinforcement learning agents in the openai gym. arXiv preprint arXiv:2203.14348. Cited by: §I, §II, §II.
  • S. Jerbi, A. Cornelissen, M. Ozols, and V. Dunjko (2022) Quantum policy gradient algorithms. arXiv preprint arXiv:2212.09328. Cited by: §II.
  • S. Jerbi, C. Gyurik, S. Marshall, H. J. Briegel, and V. Dunjko (2021) Variational quantum policies for reinforcement learning. arXiv preprint arXiv:2103.05577. Cited by: §I, §II, §II.
  • Q. Lan (2021) Variational quantum soft actor-critic. arXiv preprint arXiv:2112.11921. Cited by: §II.
  • O. Lockwood and M. Si (2020) Reinforcement learning with quantum variational circuit. In

    Proceedings of the AAAI Conference on Artificial Intelligence and Interactive Digital Entertainment

    Vol. 16, pp. 245–251. Cited by: §I, §II.
  • N. Meyer, D. D. Scherer, A. Plinge, C. Mutschler, and M. J. Hartmann (2022a) Quantum policy gradient algorithm with optimized action decoding. arXiv preprint arXiv:2212.06663. Cited by: §II.
  • N. Meyer, C. Ufrecht, M. Periyasamy, D. D. Scherer, A. Plinge, and C. Mutschler (2022b) A survey on quantum reinforcement learning. arXiv preprint arXiv:2211.03464. Cited by: §II.
  • K. Mitarai, M. Negoro, M. Kitagawa, and K. Fujii (2018) Quantum circuit learning. Physical Review A 98 (3), pp. 032309. Cited by: §IV.
  • V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. Lillicrap, T. Harley, D. Silver, and K. Kavukcuoglu (2016) Asynchronous methods for deep reinforcement learning. In International conference on machine learning, pp. 1928–1937. Cited by: §III.
  • V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, et al. (2015) Human-level control through deep reinforcement learning. nature 518 (7540), pp. 529–533. Cited by: §I.
  • M. A. Nielsen and I. L. Chuang (2010) Quantum computation and quantum information. Cited by: §I.
  • J. Preskill (2018) Quantum computing in the nisq era and beyond. Quantum 2, pp. 79. Cited by: §I.
  • J. Qi, C. H. Yang, and P. Chen (2021) Qtn-vqc: an end-to-end learning framework for quantum neural networks. arXiv preprint arXiv:2110.03861. Cited by: §IV.
  • A. Sannia, A. Giordano, N. L. Gullo, C. Mastroianni, and F. Plastina (2022) A hybrid classical-quantum approach to speed-up q-learning. arXiv preprint arXiv:2205.07730. Cited by: §II.
  • M. Schenk, E. F. Combarro, M. Grossi, V. Kain, K. S. B. Li, M. Popa, and S. Vallecorsa (2022) Hybrid actor-critic algorithm for quantum reinforcement learning at cern beam lines. arXiv preprint arXiv:2209.11044. Cited by: §II, §II.
  • A. Sequeira, L. P. Santos, and L. S. Barbosa (2022) Variational quantum policy gradients with an application to quantum control. arXiv preprint arXiv:2203.10591. Cited by: §II.
  • D. Silver, J. Schrittwieser, K. Simonyan, I. Antonoglou, A. Huang, A. Guez, T. Hubert, L. Baker, M. Lai, A. Bolton, et al. (2017) Mastering the game of go without human knowledge. nature 550 (7676), pp. 354–359. Cited by: §I.
  • A. Skolik, S. Jerbi, and V. Dunjko (2022) Quantum agents in the gym: a variational quantum algorithm for deep q-learning. Quantum 6, pp. 720. Cited by: §I, §II.
  • R. S. Sutton and A. G. Barto (2018) Reinforcement learning: an introduction. MIT press. Cited by: §III, §III.
  • S. Wiedemann, D. Hein, S. Udluft, and C. Mendl (2022) Quantum policy iteration via amplitude estimation and grover search–towards quantum advantage for reinforcement learning. arXiv preprint arXiv:2206.04741. Cited by: §II.
  • R. J. Williams (1992) Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning 8 (3-4), pp. 229–256. Cited by: §III.
  • S. Wu, S. Jin, D. Wen, and X. Wang (2020) Quantum reinforcement learning in continuous action space. arXiv preprint arXiv:2012.10711. Cited by: §II.
  • R. Yan, Y. Wang, Y. Xu, and J. Dai (2022) A multiagent quantum deep reinforcement learning method for distributed frequency control of islanded microgrids. IEEE Transactions on Control of Network Systems 9 (4), pp. 1622–1632. Cited by: §II.
  • C. H. Yang, J. Qi, S. Y. Chen, P. Chen, S. M. Siniscalchi, X. Ma, and C. Lee (2021)

    Decentralizing feature extraction with quantum convolutional neural network for automatic speech recognition

    In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6523–6527. Cited by: §IV.
  • C. H. Yang, J. Qi, S. Y. Chen, Y. Tsao, and P. Chen (2022) When bert meets quantum temporal convolution learning for text classification in heterogeneous computing. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8602–8606. Cited by: §IV.
  • W. J. Yun, Y. Kwak, J. P. Kim, H. Cho, S. Jung, J. Park, and J. Kim (2022a) Quantum multi-agent reinforcement learning via variational quantum circuit design. arXiv preprint arXiv:2203.10443. Cited by: §II.
  • W. J. Yun, J. Park, and J. Kim (2022b) Quantum multi-agent meta reinforcement learning. arXiv preprint arXiv:2208.11510. Cited by: §II.