Name*E-mail*Phone*

Details about your project

Thanks! We'll be in touch in the next 12 hours

Oops! Something went wrong while submitting the form.

Exploring OpenAI Gym: A Platform for Reinforcement Learning Algorithms

Vipul Vaibhaw

Artificial Intelligence / Machine Learning

Tags:

data science

machine learning

programming how tos

deep learning

tensorflow

Introduction

According to the OpenAI Gym GitHub repository “OpenAI Gym is a toolkit for developing and comparing reinforcement learning algorithms. This is the gym open-source library, which gives you access to a standardized set of environments.”

Open AI Gym has an environment-agent arrangement. It simply means Gym gives you access to an “agent” which can perform specific actions in an “environment”. In return, it gets the observation and reward as a consequence of performing a particular action in the environment.

There are four values that are returned by the environment for every “step” taken by the agent.

Observation (object): an environment-specific object representing your observation of the environment. For example, board state in a board game etc
Reward (float): the amount of reward/score achieved by the previous action. The scale varies between environments, but the goal is always to increase your total reward/score.
Done (boolean): whether it’s time to reset the environment again. E.g you lost your last life in the game.
Info (dict): diagnostic information useful for debugging. However, official evaluations of your agent are not allowed to use this for learning.

Following are the available Environments in the Gym:

Classic control and toy text
Algorithmic
Atari
2D and 3D robots

Here you can find a full list of environments.

Cart-Pole Problem

Here we will try to write a solve a classic control problem from Reinforcement Learning literature, “The Cart-pole Problem”.

The Cart-pole problem is defined as follows:
“A pole is attached by an un-actuated joint to a cart, which moves along a frictionless track. The system is controlled by applying a force of +1 or -1 to the cart. The pendulum starts upright, and the goal is to prevent it from falling over. A reward of +1 is provided for every timestep that the pole remains upright. The episode ends when the pole is more than 15 degrees from vertical, or the cart moves more than 2.4 units from the center.”

The following code will quickly allow you see how the problem looks like on your computer.

	import gym
	env = gym.make('CartPole-v0')
	env.reset()
	for _ in range(1000):
	env.render()
	env.step(env.action_space.sample())

view raw cart-pole-problem.py hosted with ❤ by GitHub

This is what the output will look like:

Coding the neural network

	#We first import the necessary libraries and define hyperparameters -

	import gym
	import random
	import numpy as np
	import tflearn
	from tflearn.layers.core import input_data, dropout, fully_connected
	from tflearn.layers.estimator import regression
	from statistics import median, mean
	from collections import Counter

	LR = 2.33e-4
	env = gym.make("CartPole-v0")
	observation = env.reset()
	goal_steps = 500
	score_requirement = 50
	initial_games = 10000

	#Now we will define a function to generate training data -

	def initial_population():
	# [OBS, MOVES]
	training_data = []
	# all scores:
	scores = []
	# scores above our threshold:
	accepted_scores = []
	# number of episodes
	for _ in range(initial_games):
	score = 0
	# moves specifically from this episode:
	episode_memory = []
	# previous observation that we saw
	prev_observation = []
	for _ in range(goal_steps):
	# choose random action left or right i.e (0 or 1)
	action = random.randrange(0,2)
	observation, reward, done, info = env.step(action)
	# since that the observation is returned FROM the action
	# we store previous observation and corresponding action
	if len(prev_observation) > 0 :
	episode_memory.append([prev_observation, action])
	prev_observation = observation
	score+=reward
	if done: break

	# reinforcement methodology here.
	# IF our score is higher than our threshold, we save
	# all we're doing is reinforcing the score, we're not trying
	# to influence the machine in any way as to HOW that score is
	# reached.
	if score >= score_requirement:
	accepted_scores.append(score)
	for data in episode_memory:
	# convert to one-hot (this is the output layer for our neural network)
	if data[1] == 1:
	output = [0,1]
	elif data[1] == 0:
	output = [1,0]

	# saving our training data
	training_data.append([data[0], output])

	# reset env to play again
	env.reset()
	# save overall scores
	scores.append(score)

	# Now using tflearn we will define our neural network

	def neural_network_model(input_size):

	network = input_data(shape=[None, input_size, 1], name='input')

	network = fully_connected(network, 128, activation='relu')
	network = dropout(network, 0.8)

	network = fully_connected(network, 256, activation='relu')
	network = dropout(network, 0.8)

	network = fully_connected(network, 512, activation='relu')
	network = dropout(network, 0.8)

	network = fully_connected(network, 256, activation='relu')
	network = dropout(network, 0.8)

	network = fully_connected(network, 128, activation='relu')
	network = dropout(network, 0.8)

	network = fully_connected(network, 2, activation='softmax')
	network = regression(network, optimizer='adam', learning_rate=LR, loss='categorical_crossentropy', name='targets')
	model = tflearn.DNN(network, tensorboard_dir='log')

	return model

	#It is time to train the model now -

	def train_model(training_data, model=False):

	X = np.array([i[0] for i in training_data]).reshape(-1,len(training_data[0][0]),1)
	y = [i[1] for i in training_data]

	if not model:
	model = neural_network_model(input_size = len(X[0]))

	model.fit({'input': X}, {'targets': y}, n_epoch=5, snapshot_step=500, show_metric=True, run_id='openai_CartPole')
	return model

	training_data = initial_population()

	model = train_model(training_data)

	#Training complete, now we should play the game to see how the output looks like

	scores = []
	choices = []
	for each_game in range(10):
	score = 0
	game_memory = []
	prev_obs = []
	env.reset()
	for _ in range(goal_steps):
	env.render()

	if len(prev_obs)==0:
	action = random.randrange(0,2)
	else:
	action = np.argmax(model.predict(prev_obs.reshape(-1,len(prev_obs),1))[0])

	choices.append(action)

	new_observation, reward, done, info = env.step(action)
	prev_obs = new_observation
	game_memory.append([new_observation, action])
	score+=reward
	if done: break

	scores.append(score)

	print('Average Score:',sum(scores)/len(scores))
	print('choice 1:{} choice 0:{}'.format(float((choices.count(1))/float(len(choices)))100,float((choices.count(0))/float(len(choices)))100))
	print(score_requirement)

view raw openAI-gym-network.py hosted with ❤ by GitHub

This is what the result will look like:

Conclusion

Though we haven’t used the Reinforcement Learning model in this blog, the normal fully connected neural network gave us a satisfactory accuracy of 60%. We used tflearn, which is a higher level API on top of Tensorflow for speeding-up experimentation. We hope that this blog will give you a head start in using OpenAI Gym.

We are waiting to see exciting implementations using Gym and Reinforcement Learning. Happy Coding!

Thank you! Your submission has been received!

Oops! Something went wrong while submitting the form.

Exploring OpenAI Gym: A Platform for Reinforcement Learning Algorithms

Introduction

There are four values that are returned by the environment for every “step” taken by the agent.

Observation (object): an environment-specific object representing your observation of the environment. For example, board state in a board game etc
Reward (float): the amount of reward/score achieved by the previous action. The scale varies between environments, but the goal is always to increase your total reward/score.
Done (boolean): whether it’s time to reset the environment again. E.g you lost your last life in the game.
Info (dict): diagnostic information useful for debugging. However, official evaluations of your agent are not allowed to use this for learning.

Following are the available Environments in the Gym:

Classic control and toy text
Algorithmic
Atari
2D and 3D robots

Here you can find a full list of environments.

Cart-Pole Problem

Here we will try to write a solve a classic control problem from Reinforcement Learning literature, “The Cart-pole Problem”.

The following code will quickly allow you see how the problem looks like on your computer.

	import gym
	env = gym.make('CartPole-v0')
	env.reset()
	for _ in range(1000):
	env.render()
	env.step(env.action_space.sample())

view raw cart-pole-problem.py hosted with ❤ by GitHub

This is what the output will look like:

Coding the neural network

	#We first import the necessary libraries and define hyperparameters -

	import gym
	import random
	import numpy as np
	import tflearn
	from tflearn.layers.core import input_data, dropout, fully_connected
	from tflearn.layers.estimator import regression
	from statistics import median, mean
	from collections import Counter

	LR = 2.33e-4
	env = gym.make("CartPole-v0")
	observation = env.reset()
	goal_steps = 500
	score_requirement = 50
	initial_games = 10000

	#Now we will define a function to generate training data -

	def initial_population():
	# [OBS, MOVES]
	training_data = []
	# all scores:
	scores = []
	# scores above our threshold:
	accepted_scores = []
	# number of episodes
	for _ in range(initial_games):
	score = 0
	# moves specifically from this episode:
	episode_memory = []
	# previous observation that we saw
	prev_observation = []
	for _ in range(goal_steps):
	# choose random action left or right i.e (0 or 1)
	action = random.randrange(0,2)
	observation, reward, done, info = env.step(action)
	# since that the observation is returned FROM the action
	# we store previous observation and corresponding action
	if len(prev_observation) > 0 :
	episode_memory.append([prev_observation, action])
	prev_observation = observation
	score+=reward
	if done: break

	# reinforcement methodology here.
	# IF our score is higher than our threshold, we save
	# all we're doing is reinforcing the score, we're not trying
	# to influence the machine in any way as to HOW that score is
	# reached.
	if score >= score_requirement:
	accepted_scores.append(score)
	for data in episode_memory:
	# convert to one-hot (this is the output layer for our neural network)
	if data[1] == 1:
	output = [0,1]
	elif data[1] == 0:
	output = [1,0]

	# saving our training data
	training_data.append([data[0], output])

	# reset env to play again
	env.reset()
	# save overall scores
	scores.append(score)

	# Now using tflearn we will define our neural network

	def neural_network_model(input_size):

	network = input_data(shape=[None, input_size, 1], name='input')

	network = fully_connected(network, 128, activation='relu')
	network = dropout(network, 0.8)

	network = fully_connected(network, 256, activation='relu')
	network = dropout(network, 0.8)

	network = fully_connected(network, 512, activation='relu')
	network = dropout(network, 0.8)

	network = fully_connected(network, 256, activation='relu')
	network = dropout(network, 0.8)

	network = fully_connected(network, 128, activation='relu')
	network = dropout(network, 0.8)

	network = fully_connected(network, 2, activation='softmax')
	network = regression(network, optimizer='adam', learning_rate=LR, loss='categorical_crossentropy', name='targets')
	model = tflearn.DNN(network, tensorboard_dir='log')

	return model

	#It is time to train the model now -

	def train_model(training_data, model=False):

	X = np.array([i[0] for i in training_data]).reshape(-1,len(training_data[0][0]),1)
	y = [i[1] for i in training_data]

	if not model:
	model = neural_network_model(input_size = len(X[0]))

	model.fit({'input': X}, {'targets': y}, n_epoch=5, snapshot_step=500, show_metric=True, run_id='openai_CartPole')
	return model

	training_data = initial_population()

	model = train_model(training_data)

	#Training complete, now we should play the game to see how the output looks like

	scores = []
	choices = []
	for each_game in range(10):
	score = 0
	game_memory = []
	prev_obs = []
	env.reset()
	for _ in range(goal_steps):
	env.render()

	if len(prev_obs)==0:
	action = random.randrange(0,2)
	else:
	action = np.argmax(model.predict(prev_obs.reshape(-1,len(prev_obs),1))[0])

	choices.append(action)

	new_observation, reward, done, info = env.step(action)
	prev_obs = new_observation
	game_memory.append([new_observation, action])
	score+=reward
	if done: break

	scores.append(score)

	print('Average Score:',sum(scores)/len(scores))
	print('choice 1:{} choice 0:{}'.format(float((choices.count(1))/float(len(choices)))100,float((choices.count(0))/float(len(choices)))100))
	print(score_requirement)

view raw openAI-gym-network.py hosted with ❤ by GitHub

This is what the result will look like:

Conclusion

We are waiting to see exciting implementations using Gym and Reinforcement Learning. Happy Coding!

About the Author

Velotio Technologies is an outsourced software product development partner for top technology startups and enterprises. We partner with companies to design, develop, and scale their products. Our work has been featured on TechCrunch, Product Hunt and more.

We have partnered with our customers to built 90+ transformational products in areas of edge computing, customer data platforms, exascale storage, cloud-native platforms, chatbots, clinical trials, healthcare and investment banking.

Since our founding in 2016, our team has completed more than 90 projects with 220+ employees across the following areas:

Building web/mobile applications
Architecting Cloud infrastructure and Data analytics platforms
Designing AI/ML-based solutions
Intelligent Chatbots

Talk to us

Exploring OpenAI Gym: A Platform for Reinforcement Learning Algorithms

Vipul Vaibhaw

Introduction

Cart-Pole Problem

Coding the neural network

Conclusion

MORE POSTS BY THIS AUTHOR

Vipul Vaibhaw

You may also like

Policy Insights: Chatbots and RAG in Health Insurance Navigation

Shreyash Panchal

The Responsible Use of Artificial Intelligence - Shaping a Safer Tomorrow

Shivali Bari

Vector Search: The New Frontier in Personalized Recommendations

Afshan Khan

Exploring OpenAI Gym: A Platform for Reinforcement Learning Algorithms

Introduction

Cart-Pole Problem

Coding the neural network

Conclusion

About the Author

Did you like the blog? If yes, we're sure you'll also like to work with the people who write them - our best-in-class engineering team.

We're looking for talented developers who are passionate about new emerging technologies. If that's you, get in touch with us.

About Velotio

Subscribe to get the latest technology updates

Related Posts

Services

By Company Stage

By Engagement Model

Expertise

Product Engineering

Data and AI

Cloud & DevOps

Strategy and Consulting

Subscribe to get the latest technology updates

Exploring OpenAI Gym: A Platform for Reinforcement Learning Algorithms

Vipul Vaibhaw

Introduction

Cart-Pole Problem

Coding the neural network

Conclusion

MORE POSTS BY THIS AUTHOR

Vipul Vaibhaw

You may also like

Policy Insights: Chatbots and RAG in Health Insurance Navigation

Shreyash Panchal

The Responsible Use of Artificial Intelligence - Shaping a Safer Tomorrow

Shivali Bari

Vector Search: The New Frontier in Personalized Recommendations

Afshan Khan

Exploring OpenAI Gym: A Platform for Reinforcement Learning Algorithms

Introduction

Cart-Pole Problem

Coding the neural network

Conclusion

About the Author

Did you like the blog? If yes, we're sure you'll also like to work with the people who write them - our best-in-class engineering team.

We're looking for talented developers who are passionate about new emerging technologies. If that's you, get in touch with us.

About Velotio

Subscribe to get the latest technology updates

Related Posts

Policy Insights: Chatbots and RAG in Health Insurance Navigation

The Responsible Use of Artificial Intelligence - Shaping a Safer Tomorrow

Vector Search: The New Frontier in Personalized Recommendations

Unlocking Legal Insights: Effortless Document Summarization with OpenAI's LLM and LangChain

Building an Intelligent Recommendation Engine with Collaborative Filtering

Build ML Pipelines at Scale with Kubeflow

Real Time Text Classification Using Kafka and Scikit-learn

Your Complete Guide to Building Stateless Bots Using Rasa Stack

Chatbots With Google DialogFlow: Build a Fun Reddit Chatbot in 30 Minutes

Amazon Lex + AWS Lambda: Beyond Hello World

Machine Learning for your Infrastructure: Anomaly Detection with Elastic + X-Pack

A Quick Guide to Building a Serverless Chatbot With Amazon Lex

Building an Intelligent Chatbot Using Botkit and Rasa NLU

Explanatory vs. Predictive Models in Machine Learning

Benefits of Using Chatbots: How Companies Are Using Them to Their Advantange

A Step Towards Machine Learning Algorithms: Univariate Linear Regression

A Quick Introduction to Data Analysis With Pandas

Product Engineering

Data and AI

Cloud & DevOps

Strategy and Consulting