Posted by: John Bresnahan | June 29, 2013

tempest devstack and bugs oh my!

One of the best things about OpenStack development is the rich testing and review frameworks around it.  The most important of which are tempest and devstack.  In this post I will discuss how I recently squashed a particularly painful bug (read: I make dumb mistakes) in my latest patch to Glance.

The Bug

My patch seemed like a fairly minor change.  Functionally it added little, but it touched quite a few files.  My first patch passed all the automated tests and received some thoughtful reviews from the Glance community.  I made the needed changes which seemed innocuous, but caused it to not passing the gate tests.  This was perplexing to me because it passed all the unit tests and functional tests just fine as well as my own manual tests.  Walking through the code with a suspicious eye-ball also gave me no hints as to the problem.

The output from tempest on reveiw.openstack.org was fantastic.  Not only did I get the output from all of the nosetests under python 2.6/7 and the output of the gate tests, I also accessed the logs of the devstack screen sessions (more on this later).  This output was found on the gerrit review page for the patch I submitted.  Jenkin’s added comments to it that look like the following:

jenins_fail

By clicking on one of the failure links (like the one circled above) I was taken to quite a bit of information, specifically this page where I found the console log (the output from the tempest tests) and the screen session logs (which are basically the logs from each OpenStack service).

This was amazingly helpful.  Unfortunately in my case all I found out was that the bug was indeed my fault, and that I still had no idea why.  At this point I knew that I needed to step through the code.

This was a problem for me.  I had a good development environment for stepping though Glance’s unit and functional tests, but not the gate tests.  I needed a way to run these failing tests in my own debug environment. After many fruitless attempts (which included writing my own client to simulate the test which failed — working harder not smarter) here is what I ultimately did.

Devstack

I had a Fedora 18 devstack enabled VM ready for use (if you do not have a devstack VM make one).  I originally created it using virt-manager.  I ran it with 2 vCPUs and 4GB of RAM.  To ‘devstack enable’ it I cloned devstack from github and then ran stack.sh (it is pretty much that easy, see http://devstack.org).  I then learned how easy it was to run tempest gate tests on a devstack VM.  This was literally all I did:

nosetests tempest

boom! The gate tests were running.  Not with my faulty code, and not in my debug environment.  But it was on my machine and under my control. I was half of the way there.

My Patch

To get my devstack VM to run my comically brain-dead and buggy code I had to do the following:

cd /opt/stack/glance
git fetch https://review.openstack.org/openstack/glance refs/changes/92//8 && git checkout FETCH_HEAD
git checkout -b squish_this_bug

note that the 2nd command can be copied from any git review by clicking on the button shown bellow.

gerrit

Now that I had the code in the right place I just needed to restart Glance in devstack.  To do this I simply attached to the screen session with:

screen -r

Then I found the glance-api and glance-registry sessions by hitting <ctrl+a+space_bar> until I say g-reg/g-api in the bottom toolbar:

devstack_screen

At this point I was at a command line as tho I typed the command.  So all I had to do was hit <ctrl+c>to kill it and then press the up arrow and to restart it.  Now devstack was running my troubled code.

Stepping Though The Code

I am partial to pycharm for my IDE and debugger (hey jetbrains, wanna hook a bruddah up with a free open source license?).  In a previous blog I talked about how to get that running.  I will review and expand on that a bit here.

NFS and Remote Debugging

Even tho my devstack VM was running on my laptop just like pycharm, it was still a remote process.  The first thing I had to do was get pycharm and devstack access to the same files.  I did this by running NFS inside of the devstack VM and exporting an NFS mount at /opt/stack.  I then used my host laptop as an NFS client and I mounted the VM’s /opt/stack file system on my laptop in the same place (note: it must be mounted at /opt stack).  At this point both the laptop and the VM have access to the devstack source code.

pycharm

Next I needed to create a pycharm project for glance under /opt/stack/glance. I started pycharm and clicked on File->Open Directory and then selected the directory /opt/stack/glance.  From there pycharm did the rest of the work modulo a few questions which had easy answers (click next).  Finally I had to create a remote project as outlined in my previous post.

Once pycharm was configured to accept remote connections I only needed to tell Glance to connect to it on start up.  To do this I opened /etc/glance/glance-api.conf (or /etc/glance/glance-registry.conf depending upon which service I was fighting with at the time) and added the following line:

pydev_worker_debug_host = <IP of host machine>

Then I just killed the process in the screen session and restarted it (as described above with <ctrl+c> and up arrow) and everything was all connected.

From there I was able to finally determine that in python there is in fact a very big difference between None and [].  Does anyone want to fund my first trip to PyCon?!

Posted by: John Bresnahan | April 25, 2013

A Picture Can Beat 1000 Dead Horses

Unless this is your first time reading my blog, you are probably aware that I am beginning to become obsessed with the idea of a data transfer service.  In this post I continue the topic from my previous post by introducing a couple of diagrams.

the_wild_west

A diagram of a possible swift deployment is on the right side.  On the left is a client to that service.  The swift deployment is very well managed, redundant and highly available.  The client speaks to the swift via a well defined REST API and using supported client side software to interpret the protocol.  However, between the server side network protocol interpreter and the client side network protocol interpreter is the wild west.

The wild west is completely unprotected and unmanaged. Many things can occur that cause a lost, slow, or disruptive transfer.  For example

  • Dropped connections
  • Congestion events
  • Network partitions

Such problems make data transfer expensive.  Ideally there would be a service to oversee the transfer.  Transfer could be check-pointed as they progress so that if a connection is dropped it could be restarted with minimal loss.  Also it could try to maximize the efficiency of the pathway between the source and the destination by tuning the protocols in use (like setting a good value for the TCP window), or using multicast protocols where appropriate (like bittorrent), or scheduling transfers so as to not shoot itself in the foot.

A safer architecture would look like this:

tamed_west

The transfer service is now in a position to manage the transfer thus it allows for the following:

  • A fire and forget asynchronous transfer request from the client.
  • Baby sit and checkpoint the transfer.  If it fails restart it from the last checkpoint.
  • Schedule transfer for optimal times.
  • Prioritize transfers and their use of the network.
  • Coalesce transfer requests and schedule appropriately and into multicast sessions.
  • Negotiate the best possible protocol between the two endpoints.
  • Verify that the data successfully is written to the destination storage system and verify its integrity.
Posted by: John Bresnahan | April 24, 2013

Storage != Transfer

In this post I argue that the concepts of data transfer and data storage should not be conflated into a single solution.  Like many problems in computer science, by abstracting problems into their own solution space, they can be more easily solved.  I believe that OpenStack can benefit from a new component that offloads the burden of optimally transferring images from existing components like nova-compute and swift.

Storage Systems

Within the OpenStack world there are a few interesting storage systems. Swift, Gluster, and Ceph are just three that immediately come to mind. These systems do amazing things like data redundancy, distribution, high availability, parallel access, and consistency to name just a few.  As such systems get more complex they can become aware of caching levels and tertiary storage. Storage systems also need to be concerned with the integrity of the physical media used to store the data which quickly leads to a system of checksums and forward error correction.  One can imagine how complex that can become.

I have probably missed many other challenges, and yet that list alone is near daunting.  In addition to it, storage systems need an access protocol that enables reading and writing data. The access protocol is used in many ways including random access, block level IO, small chunks, large chucks, and parallel IO.

With the access protocol users can also stream large data sets from the storage system to a client (and thereby another storage system), even across a WAN.  However I argue that such actions are often best left to a service dedicated to that job (as I described in a previous post). The storage systems control domain ends at its API.  After that, the all bytes coming and going are in the wild west.

Transfer Service

A transfer service’s primary responsibility is moving data from one place to another in the most efficient, safe, and effective way.  GridFTP and Globus Online provide good examples of transfer services.  The transfer service’s job is to make the lawless land between two storage systems safer.  It’s duty is to make sure that all bytes (or bytes that look just like them) make it across the network and to the destination, safely and quickly and without disruption to other travellers.

When dealing with large data set transfers the following must be considered:

  • Restart transfers that fail after partial completion without having to retransmit large amounts of data.
  • Negotiate the fastest/best protocol between endpoints.
  • Set protocol specific parameters for optimal performance (eg: TCP window size).
  • Schedule transfer for an optimal time (which can prevent thrashing).
  • Mange the resources it is using (network, CPU, etc) of both the source and destination and prevent over heating.
  • Allow for 3rd party transfers (do not force the end user to speak ever complex protocol).

Just as the transfer service is not concerned with data once it safely hits a storage system, the storage system should not be concerned with the above list.  Yet both services are needed in an offering like OpenStack.

Summary

When data is written to storage it should be kept safe and available.  When it is read the exact same data should be immediately available and correct.  That is the charge placed on the storage system, and that is where its charge should reasonably end.  The storage system cannot be responsible for making sure the data crosses networks to other storage systems which are often out side of its control safely and in the most efficient manner.  That is asking too much of one logical component.  That is the job of a transfer service.

Posted by: John Bresnahan | April 22, 2013

Portland OpenStack Summit

I joyfully attended the OpenStack Summit in Portland last week.  The sessions and conversations educated and inspired me.  I finally saw the faces of the talented developers with whom I have worked for months.  In particular I would like to thank Nikhil Komawar, Alex MeadeBrian Rosmaita, and Iccha Sethi for helping hash through the ideas in our unconference session “Image Transfer Service”.

There are quite a few topics that I want to explore further in future blogs (hopefully soon) including:

  • Image transfer as first class functionality (not just a side effect of storage)
  • Use cases for an image transfer service
  • Image replica management
  • Centralized image repository for HPC use cases

But for now I wanted to thank Red Hat for sending me to the Summit, and thank all of the great minds that tossed around such interesting conversation during the Summit.  It was truly a gift for me to be there.

Posted by: John Bresnahan | April 9, 2013

Free Cache

The Need For A Cache In Glance

Glance can be configured with a variety of back-end storage modules.  It can be backed by swift, Ceph, a local file system, S3, another HTTP service, etc.  When a client goes to download an image, Glance first finds the back-end store where the image resides and connections to it. The image is streamed from that back-end store to Glance where it is then routed to the client.  If the back-end store is a local file system then this is fast and easy.  If however, it is S3, or a remote Swift service, or HTTP or anything else that could potentially be across a WAN, a significant price is paid for the network hop needed to translate it through Glance.

In an effort to alleviate this problem caching functionality exists in Glance.  When an image is transferred from a back-end store to glance on its way to the client, it is written to a local file-system as well as being sent to the client.  In this way, if that image is ever requested in the future glance does not need to contact the back-end store, instead it can open up the local file and send it without that potentially expensive WAN hop.

The Problem

The cache described above does its job well. However a small problem is introduced when Glance API services are scaled out horizontally over many nodes.  Say you have several Glance API servers running on different nodes that are all backed by the same Swift service.  The pool of glance servers are all behind the same VIP or DNSRR, so to the client they all appear to be the same endpoint.  However, because they have different disks, each will have its own cache.  If a client downloads an image, it may be routed to Glance-API server A.  The image will be retrieved from Swift to A, then cached on A‘s disk as it is sent to the client.  Now if another client goes to download that same image DNSRR may send them to Glance-API server BB does not have this image in its cache so it will have to go back to Swift and download it.

Such is life with cache, sometimes you hit and sometimes you miss.  Not a problem.  What is a problem is how administrators are currently able to manage the cache of each node.

Currently within Glance there are some cache management API calls that an administrator can use to do things like see what is in the cache, delete images from the cache, etc.  Obviously this is a problem if these calls goes through DNSRR.  One call may be directed to host A, while the next is directed to host B.  Each host has entirely different datasets and thus there is no consistency between calls.  Further, the caller has no control with which endpoint they actually contact.

Solutions

One proposed solution to this problem is to separate out the administrator interface to cache management from the rest of the API calls and into its own service.  This service would then be run on each node and there would be no DNSRR in front of it.  The administrator would contact and manage each cache separately and thus consistency would be achieved.  This certainly solves the problem and is a reasonable thing to do.

However, I would like to look at this problem from a different angle….

Multiple Locations

There is an active blueprint in the works for Glance called Multiple Image Locations.   This describes a means for glance to return to a client a list of locations where the image is stored.  From there, the client may be able to access those locations directly, instead of routing the data though glance.  This is a potentially large optimization.

Glance has always been an image registry (as well as a transfer service).  With this enhancement it also becomes a replication service.  Not only can clients find images, but they can also discover specifically where those images are and from there select the best location for their needs.

For example, say that an image is downloaded from a third-party anonymous HTTP site and put into a Glance service which is backed by Swift.  With Multiple Locations the meta-data for that image can now present the client with 3 options for download: a Swift URL, a Glance URL, and the original HTTP URL.  The client knows where itself is located in the network topology and the workload of its system much better than Glance does, thus it is in the best place to choose the ideal source of the image.  Pushing on this a bit more, lets say that Glance was backed by a file system instead of Swift and that the client wishing get the image has access to the same file system.  In that case a system copy could be done which would avoid a lot of extra cycles.  (note this is not a far-fetched case, options in nova-compute exists for this right now).

The cached copy may be the best location from which a client can download, but it may not be.  The client may be in a position on the network where direct access to Swift is faster, or a copy from a shared file system like Gluster is possible.  There are many reasons why the cached copy may not be the best place, and it will always be the client (or more specifically the downloader of that image) that is in the best position to make that choice.

What is Cache Anyway?

Ultimately a cached copy of an image is simply just another replicated location of that image.  It may be a more transient location, but really that is all defined by the SLAs of any given store and is a policy outside of the control of Glance.  There is little difference between the cached copy and a copy on some remote Swift service or some HTTP server.  So why treat it any differently?

Is the cache going to need tools to manage constancy that other replicas will not? If we have a service designed specifically for managing cached copies (verifying, removing, and generally maintaining consistency) won’t we also need similar tools to accomplish the same thing for other replicated copies of the image held in its meta data?  Ideally we could generalize this into a single solution for both.

A Proposal

I suggest that we fold the idea of caching into the idea of multiple locations.  When an image is routed through the Glance service it should (or could) still be written to a local disk cache.  Once the copy is made, Glance would update the meta-data for that image with an HTTP URL that points to this hosts IP address (not the VIP or the DNS name).  Thus when a client looks up all of the locations of the image that cached point is listed with the others and the client is thus empowered to pick the best location for itself.  When downloading an image from Glance in the traditional way Glance could check the list of locations in a similar way.  If there is cached copy that is deemed the best location it would open it and stream it to the client and business would be conducted as it is today.

Replica Management

Lets get back to the problem at hand: the administrators service for managing the image cache.  I argue that this should be generalized to a set of tools for managing replicas.  Lets deprecate the current service and instead create tools which will solve the problems of the future as well.  An admin is going to need a way to verify the consistency of all the locations listed for a given image, not just the ones on local disk.  The details of how this tool (or service) would work should be hashed out by the community, but in general it would contact Glance, get a list of replicated locations points for a given image, and then perform the needed operations on them based upon their URL.  It would ensure the consistency of all the points of replication and not just the one special case.

Posted by: John Bresnahan | March 26, 2013

The Cost Of Client Side Image Downloads On the CPU

Introduction

In a previous post I discussed the importance of managing the resource consumption of large file transfers.  Here I illustrate one of the lesser considered resources involved, the CPU.  The NIC (and the network in general) is always thought of as a consumable resource involved in a data transfer.  To a lesser extent the disk bandwidth is considered, and on occasion the system bus is as well.  However, the effects on the CPU tend to be underestimated.

Nova-compute is co-located with the hypervisor and all of the CPUs running user virtual machines (VMs), thus the effects of the OpenStack services on the CPU are important to consider.  The amount to which nova-compute is itself a noisy neighbor to user VMs should be minimized, or at least understood.  Currently nova-compute downloads images by importing the python-glanceclient module and making calls to it that execute the HTTP/HTTPS protocol.  Let’s take a look at the cost to the CPU of such a download.

The Experiment

I used the following set up to study this problem:

  • A Lenovo T530 laptop with 4 cores and 8GB RAM
  • A VM with 2GB of memory and 2 cores running on the above laptop
  • Fedora 18 installed on the VM
  • DevStack running a Glance server
  • A 2GB image (just a file I made from /dev/urandom)

I uploaded the image to the Glance service.  I then downloaded the image with two different clients:

  1. curl: This shows the results from a client without some of the complicated python inefficiencies and thus provide a baseline for the best case situation.
  2. python-glanceclient: This provides a much more realistic look at the load that nova-compute is introducing to a hypervisor

The download trials were performed with both HTTP and HTTPS over the loopback interface.  The data was written to /dev/null, thus eliminating any overhead of the file system. The CPU load was measured in two different ways.  First I used GNU time 1.7, this gave an overall summary of the resources consumed.  To show the load over time I used top (the exact command was top -b -d 1 -p <pid>).

Results

CPU Load Measured From GNU Time

Client Type Time CPU Load
curl HTTP 9.61s 12.00%
curl HTTPS 35.65s 89.00%
glance HTTP 9.57s 66.00%
glance HTTPS 39.98s 90.00%

CPU Load Over Time From top

curl_downloads

glance_download

Conclusions

The above results show that a substantial amount of processing overhead is introduced by downloading images, especially when the expense of SSL is added.  It also shows that even without SSL, an HTTP download can be costly on a CPU when using python code.  While this is only for a short period of time, in the case of SSL it is for enough time that it should be managed.  Further, it should be noted that the problem is exacerbated when taking into account the effects of many images downloaded at the same time (which is likely to happen as machines like the SM15000 are considered for use with OpenStack).

Posted by: John Bresnahan | March 6, 2013

A Look At Performance When Glance Is Backed By Gluster

In a previous post I described how to configure OpenStack Nova and Glance to be backed by Red Hat Storage (Gluster).  Here I will push on that thought a bit by looking at the performance of Glance when backed by Gluster in a couple of different configurations.

Backing Glance by Gluster has a few advantages

  1. High Availability.  Gluster can be configured with various levels of redundancy and replication. By having Glance store its data in Gluster these features are passed up to Glance.
  2. Glance Server Horizontal Scaling.  Because Gluster is a shared file system many Glance servers can mount it and have access to the same files and namespace.  Therefore many Glance services can sit behind a virtual IP or a load balancer or DNSRR, and thus be able to handle higher client loads.
  3. When configured in conjunction with nova, the overhead of image propagation can be greatly reduced.

However, before jumping to too many conclusions about the miracles of this dynamic duo I decided to look at some of the potential costs.  Specifically I am looking at the performance effects.

I got my hands on 4 beefy machines (64GB of RAM and 64 processors each!).  I configured two of them to be Gluster storage bricks, another to be a Glance server, and the final one to be a Glance client.

Baseline Numbers

Network

The first thing I wanted to do was get an idea of the performance of the local file system and the network in order to establish a high water mark.  I used iperf to measure the TCP network speed between all 4 machines.  After several trials the results were consistently between 940 and 942 megabits per second (this is a common results for untuned gigabit ethernet).

Local Disk

For the local disk I created a 1GB file with the following command:

dd if=/dev/urandom of=1GBtestfile count=1024 bs=1048576

I then measured the time it took to copy that file to another location on that file system.  Due to the fact that most of data stayed in kernel memory buffers the results were quite fast.  To mitigate the effects of caching I also measured the time it took to do the copy followed by a sync.  The average of 10 trials is show below:

disk_speed1

Without sync the average is almost 6 gigabits/second.  With the sync it is 818 megabits/second.

Gluster FS

On the Gluster storage nodes I set up two volumes, one distributed and the other replicated.  I added the same 1GB file used above to each of the volumes.  Then, for each volume type I performed two tests:

  1. Copy the file from Gluster to local storage
  2. Copy the file from the Gluster volume to the same Gluster volume

I did this when both syncing the data and not syncing the data, the results are below.

gfs_speed1

Glance

By looking at the above results we can begin to guess how a Gluster backed Glance service would perform in this set up.  Glance will copy a file out of Gluster, then stream it over HTTP to the client where it will be written to local disk.  Thus the performance should be at best the same as the GFS to local cases shown above.  In the next experiment Glance was configured with a local disk, and with Gluster (both volume types).  The 1 GB file was uploaded to Glance.  The time it took to download the file in all cases was then measured.  The average of 10 trials is shown below.

glance_speed1

As shown, when Glance is backed by Gluster there is a performance hit, but it is fairly small when considered against the feature set that Gluster offers Glance.  Not to mention the fact that we are not yet looking at replicated Glance services against the same shared Gluster file system (see Future Work).  In the without sync case a local disk is about 20% faster, in the with sync case local disk is about 15% faster.

Now lets look at the overhead added by Glance.  The following graphs compare a copy from Gluster to the local disk with a download of Glance backed by Gluster.  The difference is the overhead added by Glance.

glance_overhead

I thought some might find it convenient to see all the results in one place, so here is a final graph showing that (note that Local Disk in the without sync case is off the chart.  The same scale was used to better display the more interesting results):

all_results1

Future Work

This is a very simple look at the performance of the integration of the two systems.  It is just a single client, against a single Glance service backed by a small Gluster cluster.  In the future it would be useful to study the effects of a heavy load of simultaneous clients all hitting an increasing number of Glance servers backed by a slightly larger Red Hat storage cluster.  In these circumstances I think it will show some serious advantages to this set up.  Hopefully I will someday have the resources to study this and that this first post will provide some context to that study.

I also hope to study the effects of copy a file directly out of glance and to nova thus eliminating the overhead introduced by Glance.

Posted by: John Bresnahan | January 24, 2013

Debugging OpenStack with pycharm and pydevd

Introduction

For years I was a strict vi user who peered warily at clickety-clack IDEs. I wrote mostly C code and happily used gdb for debugging.  Then I got involved with a large Java project and working on it with anything other than IDEA was an avalanche of lost productivity.  From there I decided to try out pycharm for my python work and I really liked it.  I now look back on my arrogant stance against IDEs and wonder how much more I could have accomplished were I to use tools outside of the Luddite code.  The main feature of interest for me is its integrated debugger.

OpenStack recently accepted two patches to Glance and Keystone which allow for remote debugging via pydevd.   You can run a pycharm debugger on one machine and Glance/Keystone (and hopefully more to come) services on another.  When the service starts up it will connect back to the pycharm debugger providing all the joys of local integrated debugging on a remotely running service.

That said, in this post I will only address the case where the debugger and the service run on the same machine (or at least on machines that share the same file system).  I will address any questions about a real remote setup in comments or a later post.

Pycharm Setup

Much of this is explained in this post, however I will summarize it here with what worked for me.  The first thing to do is create a pycharm project.

  1. checkout Glance from git
  2. start pycharm (install it first 😉 ) and create the glance project
    • File->Open Directory->Select your directory

Next set up a remote debugger for the project.  Bring up the debug configurations window by clicking on Run -> Edit Configurations.  You should now see something like that is below:

ren_debug_window

Click on the “+” in the upper left corner and select Python Remote Debugging.  Make sure Local host name: is set to “localhost” and Port: is set to 5678.  Set the name at the top to Remote.  It should look like the following:

remote_debug_setup

Click OK. Finally start up the debugger with Run -> Debug Remote, or by clicking on the icon in the top center that looks like the following:

debug_arrow

In the bottom pane of pycharm you should now see the text:

Starting debug server at port 5678
Waiting for connection...

Python Setup

The pydev libraries must be installed into the python environment which will run the OpenStack services.  Find the file pycharm-debug.egg under the directory where you installed pycharm.  It should be in the top-level directory.  Now do the following:

  1. cd <python installation that will run glance>/lib/python2.7/site-packages
  2. unzip <installation of pycharm>/pycharm-debug.egg

pydev should now be installed.

OpenStack Service Setup

In two earlier posts I explained how to manually setup Glance and Keystone development environments.  Have that setup (or whatever works for you) and make sure that you have code that includes the needed patches reviewed here and here.

Keystone Configuration

Run keystone-all from the command line with the following two options:

  1. –pydev-debug-host localhost
  2. –pydev-debug-port 5678
keystone-all --pydev-debug-host localhost  --pydev-debug-port 5678

Once it starts up the debug console in pycharm should show some messages notifying you that

Glance Configuration

With Glance you can set up debugging for either the glance-api server or the glance-registry server.

To set up glance-api debugging open up the glance-api.conf file and make sure the following are set:

pydev_worker_debug_host = localhost
workers = 1

The workers line is important because glance can use many worker processes, however the debugger can only handle one connection.

To set up debugging for glance-registry open glance-registry.conf and make sure the following is set:

pydev_worker_debug_host = localhost

Note: you can only debug one of the services at a time, so do not set the pydev_worker_debug_host option in both files at the same time.

Run the service (either glance-api or glance-registry) and notice how it connects back to pycharm for debugging.

Posted by: John Bresnahan | January 16, 2013

OpenStack Glance And Nova Backed By Red Hat Storage

In this post I will explain how to configure Glance and Nova to be backed by Red Hat Storage (and thereby gluster).

The ISOs that I link to in this post require a Red Hat Network subscription.  The information here can be useful without it, but it is less convenient.

Deployment Details

In my deployment I use two virtual machines.  One for Red Hat Storage and the other for all of the OpenStack services (Keystone, Glance, and Nova).  Both VMs run on my Lenovo T530 laptop which runs Fedora 17.  I will not go into detail about how to create VMs because that is covered pretty well in other places, but I will touch on it just to make it as clear as possible.

Red Hat Storage VM

For the base VM I wanted a lot of space so I created a VM with 48GB of storage (10 should be plenty) and configured it to have 1GB of RAM and 1 CPU.

  1. Get Red Hat Storage by going here and clicking on Binary DVD.
  2. Create a VM with at least 8 GB of storage (mine had 48) and install Red Hat Storage by following the instructions here through step 7.1.
  3. Create a gluster volume with the following commands:
gluster volume create testvol myhost:/exp1
gluster volume start testvol
/etc/init.d/glusterd restart

The VM should now be ready to serve a gluster file system.

OpenStack VM

Here we will use a single VM for all of the OpenStack services.  The explanation of how to back either Glance or Nova by gluster will be the same for a distributed environment.

Create a VM with 8GB of storage using RHEL 6.4 (64 bit) by downloading the binary DVD available here

Install OpenStack by following the instructions here through step 6.

At this point you should have a VM which is running enough OpenStack services to upload VMs, launch them, and ssh into them.  The instructions referenced above should provide you with steps to verify this.

note: in selinux I had to run the following command in order to properly boot images:
setenforce permissive

You may also need to open up /etc/nova/nova.conf and make sure you have the line:

libvirt_type = qemu

Mount Gluster

The OpenStack VM now needs to mount the storage system provided by Red Hat Storage so that it can be used by Glance and Nova.  First the VM must be configured so that the recommend gluster client can be used.  The instructions for this are here, but I will put them in my own words.

Goto https://access.redhat.com/subscriptions/rhntransition and navigate: Subscriptions -> RHN Classic -> Registered Systems.

Find your system and click on it.  Click on ‘Alter Channel Subscriptions” on the new page.  Find Additional Services Channels for Red Hat Enterprise Linux 6 for x86_64  and expand it.  Select Red Hat Storage Native Client (RHEL Server 6 for x86_64) then click the Update Subscription button.

Now on the OpenStack machine run:

yum install glusterfs-fuse glusterfs
mkdir -p /mnt/gluster/
mount -t glusterfs <storage VM IP>:/testvol /mnt/gluster

At this point the OpenStack VM should be able to access the gluster file system.

Configure Glance

In order to change the path that Glance uses for its file system store only a single line in /etc/glance/glance-api.conf needs to be changed.

filesystem_store_datadir = /mnt/gluster/glance/images

Now run the following commands

mkdir -p /mnt/gluster/glance/images
chown -R glance:glance /mnt/gluster/glance/
create the directory for the instance store
mkdir /mnt/gluster/instance/
chown -R nova:nova  /mnt/gluster/instance/
service openstack-glance-api restart

At this point glance should be backed by the gluster file system.  Lets upload a file to glance and verify this.  A good test image is available here.

glance image-create --name="test" --is-public=true --container-format=ovf --disk-format=qcow2 < f16-x86_64-openstack-sda.qcow2
ls -l /mnt/gluster/glance/images

Configure Nova

The final step is to configure Nova such that nova-compute uses gluster for its instance store.  The instance store is the temporary area to which the VM is copied and then booted.  Just as it was with Glance, configuring nova to use gluster in this way is a simple one line file change.  Open the file /etc/nova/nova.conf and file the key instances_path.  Change the line to be the following:

instances_path = /mnt/gluster/instance

Now setup the correct paths and permissions and restart nova-compute.

mkdir  -p /mnt/gluster/instance/
chown -R nova:nova  /mnt/gluster/instance/
service openstack-nova-compute restart

That should be all that is needed.

Future Work

The idea behind this is that if Glance and Nova are backed by the same file system that image propagation should be much faster.  In the future I will be looking for a testbed where I can verify this.

Update

You should now be able to automatically mount on boot, add the following to you /etc/fstab file:

<gluster IP>:/glustervol /mnt/gluster glusterfs defaults,_netdev 0 0
Posted by: John Bresnahan | January 11, 2013

An Image Transfers Service For OpenStack

In this post I am presenting reasons OpenStack could benefit from a new transfer service component. This component would receive VM images and write them to nova instance storage on behalf of nova-compute providing the following benefits:

  1. A more flexible architecture
  2. A more predictable quality of service for end user’s VM instances.
  3. A cleaner separation of concerns for nova-compute.

Introduction

IaaS clouds must transfer VM images from the repositories in which they reside to compute nodes where they are booted.  In  the current state of OpenStack images download via HTTP to a nova-compute client speaking to the  Glance image service.  Within the Glance v2 protocol a notion of a direct URL exists.  Looking forward, clients can leverage this URL to bypass Glance as a transfer service, allowing direct access to the backend store.

This feature introduction helps because depending on the architecture of a given OpenStack deployment a download using Glance HTTP may perform suboptimally.  In some cases, a direct image pull from swift transfers data faster than any other option. In other cases a simple file copy excels.  In more complicated, yet real world examples, a bittorrent swarm propagates the image ideally.  Other IaaS offerings use advanced protocols like GridFTP and LANTorrent for fast image propagation. Because the best possible option varies, the Glance community created a blueprint to expose multiple locations of the single object to the user .

The  idea is that Glance will operate in part as an image replica service translating logical names (image IDs) into physical locations (URLS).  The client receives the list of physical locations and the responsibility of choosing the best one.  Unfortunately this can be a complicated decision to make.

Advantages Of A Separate Service

Simply being able to speak every protocol puts an unnecessary burden on the client.  As the set of protocols grows, so must the dependencies of the client.  Moreover the client may need global information in order to select the best URL, for example the network topology between endpoints or the current demand on each endpoint.  Advanced optimizations like queueing and reordering requests could have dramatic performance benefits in certain use cases and thus should be possible.  Such complicated logic burdens components like nova-compute whose primary concern has nothing to do with data transfer.

Virtual machine images commonly range in size from hundreds of megabytes to gigabytes. Downloading such files can require substantial resources. Not only does it place demands on the network and the system’s NIC but it also taxes memory, bus, and CPU resources particularly if it is an encrypted or highly compressed data stream.  If the system doing the download shares resource with co-located services, or if users expect a certain quality of service then resource management is important.  This is the case with nova-compute nodes.  In addition to running OpenStack services, nova-compute nodes also run end-user VMs. These VMs expect a reasonable, predictable, and consistent quality of service.  If the download dominates resources it can disrupt the processing of the nova-compute control messages, or worse it can act as a noisy neighbor to user VMs.

Creating a separate service to handle the downloading of images would not only help with the above two problems, but it would also add flexibility.  Some OpenStack deployments consist of a rack of compute nodes that mount a SAN for use as an instance store. Instead of provisioning each node with the resources necessary to download images, a separate node could mount the file system and be used strictly to transfer images into the instance store.  As more transfer resources are needed, additional transfer nodes could be added.  The decoupling of transfer services and compute services not only provides the flexibility needed to meet this deployment scenario, but it also allows the transfer of images to be well-managed and provides a more predictable quality of service with horizontal scalability.

Third Party Transfers

This idea centers around the concept of the third-party transfer. The most familiar transfer case is a two-party transfer.  In this case a client either uploads, or downloads a file directly from the server.  All of the bytes stream through the client, but the server handles most of the heavy lifting involved with resource management, security, DOS attacks, and the like.  The client is typically (tho not necessarily) thin and simple. This is the case for GET and PUT operations in Glance.  In a third-party transfer the client contacts two servers, a source and a destination.  It routes enough information between the two of them to negotiate a  means of transferring the data.  It then steps out of the way and lets the two servers stream the data.  The client does not perform any protocol interpretation needed to stream the bytes, nor does it see the bytes.  The two endpoint servers entirely handle that part of the transfer.  The cost to the client of every transfer is fixed and does not scale with regard to the size of  the data set being transferred.  A detailed description of third-party transfers can be found in the FTP RFC.  An animated gif of this process can be found here..

To put this in OpenStack terms, nova-compute would be a client which would make requests to two transfer services, one with access to the source data store and another with access to the destination data store (the nova instance storage file system).  The nova-compute transfer client would route enough information between the two services so that it could instruct them to perform the transfer.  It could then asynchronously poll for status or its initial request could block until completion.

With such a system in place, the transfer service is now in a position to provide a more graceful – and optimal – degradation of service.  Because requests to any given endpoint come to a single (yet possibly replicated) service, the service can queue up  requests making the most efficient use of the network as possible.  When compared against the case where every image booted results in a download that has TCP back off algorithms as the only means of providing fair sharing (rarely do congestion events lead to optimal network usage) the benefits are noteworthy.

New Service Design

Hopefully the OpenStack community will buy this pitch, because this effort will take a village.  For now I will have to mark the most fun part (designing and writing software) as TBD.

« Newer Posts - Older Posts »

Categories