In this post I argue that the concepts of data transfer and data storage should not be conflated into a single solution. Like many problems in computer science, by abstracting problems into their own solution space, they can be more easily solved. I believe that OpenStack can benefit from a new component that offloads the burden of optimally transferring images from existing components like nova-compute and swift.
Within the OpenStack world there are a few interesting storage systems. Swift, Gluster, and Ceph are just three that immediately come to mind. These systems do amazing things like data redundancy, distribution, high availability, parallel access, and consistency to name just a few. As such systems get more complex they can become aware of caching levels and tertiary storage. Storage systems also need to be concerned with the integrity of the physical media used to store the data which quickly leads to a system of checksums and forward error correction. One can imagine how complex that can become.
I have probably missed many other challenges, and yet that list alone is near daunting. In addition to it, storage systems need an access protocol that enables reading and writing data. The access protocol is used in many ways including random access, block level IO, small chunks, large chucks, and parallel IO.
With the access protocol users can also stream large data sets from the storage system to a client (and thereby another storage system), even across a WAN. However I argue that such actions are often best left to a service dedicated to that job (as I described in a previous post). The storage systems control domain ends at its API. After that, the all bytes coming and going are in the wild west.
A transfer service’s primary responsibility is moving data from one place to another in the most efficient, safe, and effective way. GridFTP and Globus Online provide good examples of transfer services. The transfer service’s job is to make the lawless land between two storage systems safer. It’s duty is to make sure that all bytes (or bytes that look just like them) make it across the network and to the destination, safely and quickly and without disruption to other travellers.
When dealing with large data set transfers the following must be considered:
- Restart transfers that fail after partial completion without having to retransmit large amounts of data.
- Negotiate the fastest/best protocol between endpoints.
- Set protocol specific parameters for optimal performance (eg: TCP window size).
- Schedule transfer for an optimal time (which can prevent thrashing).
- Mange the resources it is using (network, CPU, etc) of both the source and destination and prevent over heating.
- Allow for 3rd party transfers (do not force the end user to speak ever complex protocol).
Just as the transfer service is not concerned with data once it safely hits a storage system, the storage system should not be concerned with the above list. Yet both services are needed in an offering like OpenStack.
When data is written to storage it should be kept safe and available. When it is read the exact same data should be immediately available and correct. That is the charge placed on the storage system, and that is where its charge should reasonably end. The storage system cannot be responsible for making sure the data crosses networks to other storage systems which are often out side of its control safely and in the most efficient manner. That is asking too much of one logical component. That is the job of a transfer service.