CloudFFS : Large scale, high performance files storage system
CloudFFS
So we built and deployed this new service not long ago at work, and @stelabouras suggested we document some parts of it for internal consumption. Given that I haven't blogged for months, I thought I 'd just pour those words here instead.
CloudFFS (yes, it is a funny name) is a file-system (but not in the traditional sense, it doesn't hook into the kernel VFS layer or anything ) that provides storage for unbound number of files and very fast access to them over HTTP. It can manage PB scale volumes and up to 2^64 files per namespace(see below).
We have hundreds of millions of static files(images, video, text files, you name it) stored across our storage devices; having to deal with those many files is not a pleasant task, for our sys.operators and developers alike. We wanted a solution that frees our developers from having to worry about storage and provides a very simple way to store and retrieve files, and at the same time help our systems guys deal with backups and management of those files efficiently.
There are many problems associated with the use of multiple files. Wasted inodes/disk blocks, slower access time (iterating a path components is not free, looking up a directory entity within a directory is not free either), difficulty in making backups, need and use of elaborate directory naming schemes in order to deal with large directories, and more. In addition that, accessing those files over a network filesystem (e.g NFS) is not efficient by any means. Developers need to be aware of those limitations and of the rules that are in place in order to deal with said limitations, which places an unnecessary burden on them.
None of the solutions we looked into really seemed all that great for us, so we went ahead and build our own. Though, to be fair, we almost always end up building our own anyway. This practice has worked great for us all those years and given that we are a technology company, it makes sense for us to disregard the 'not invented here' approach.
Data Model
Files are uniquely identified by a 64bit number. They belong in namespaces, for example 'blogs', or 'images, or 'mails'. A file can hold up to 1GB of data. Files can also be either public, or private. Public files can be accessed directly (e.g http://x.pstatic.gr/me/305896160755729.jpg), whereas private files require HTTP authentication. This makes it possible to, say, make everything accessible over the public Web, except files that should not be accessible in that fashion (e.g log files, archived content, emails, etc ).
On disk, each namespace is represented as a directory in the 'root' directory. Within each namespace directory, there are 2 types of files. Data files and indices. They are further subdivided into 'live' and 'immutable' files. Immutable files, once they are created, are never updated again. In practice, there is almost always a single live datafile/live index associated with a namespace (there are cases where there can be 2 for a few seconds, whenever a live datafile is converted to an immutable one) and that live datafile holds incoming updates. Whenever a SET or DEL operation is executed, a new record is appended into the live data file and its respective live index. Once the live data file size exceeds a threshold, it turns into an immutable one and a new live datafile is created and used instead. Whenever it is necessary, a compaction task kicks in that will merge immutable files, discard deleted files and duplicates and create a new set of immutable files out of them and delete the old ones. This happens fairly infrequently though ( depends on the volume of data, but usually once every few month ). A live or immutable data file can hold thousands, millions, or even billions files, packed one after the other. Maybe I will get to write more about the structure of those files and how they are used in a future blog post.
Operations
There are 4 operations that are all mapped to HTTP methods/verbs. Get(GET), Set(PUT), Stat(HEAD) and Delete(DEL). A few things that may be worth noting; Whenever you set a file you can specify if it is a public, or a private file. Whenever you Set or Delete a file, or Get or Stat a file that has been stored as private, you need to specify authentication credentials(username/password). Otherwise, the operation fails (authentication failed). The list of users is defined in the configuration file.
Internals
The service is implemented as a multi-threaded application. A single thread handles network I/O (asynchronous I/O multiplexing w/ vector I/O). There are also some threads for processing requests and 2 threads for processing system tasks (compactions and live to immutable data conversions). The network I/O thread also accepts incoming connections and parses in HTTP requests. Whenever such request is parsed, it is placed in the 'mailbox' queue of the first idle thread for processing. The network I/O thread also accepts RPC connections/requests (for management) and also talks to our message queue service ; whenever a file is stored or deleted a new event is published into the queue service so that we can replicate data or whatever else we choose to do whenever those events are created.
There are 2 types of caches in place. One for immutable datafile keys ranges(a keys range holds a series of file ids and their (offset, size) into their respective immutable datafile), and another one for compressed content (whenever the agent/browser supports gzip/deflate encoding and the file requests can be provided back to the client in compressed form, we compress and cache the contents into that cache for later use). They both operate in an LRU fashion are protected by spinlocks; the cached objects are ref.counted.
Whenever a Get request is executed, we iterate through the list of the live datafiles for the namespace(usually 1). If not found in there, we walk the list of the immutable datafiles(they are always sorted by creation time) until we get a match, or not. A Set operation appends the data into the active live data file for the namespace, syncs on disk and then lazily appends on the index.
Files on disk (both on live and immutable files) are stored as {header} data {footer}. The header holds the key/id, last modified timestamp, size and flags(private, public, etc). The footer currently holds a crc32 checksum which we consult whenever we pull data from disk for integrity checks. If for whatever reason any file (live or immutable, datafile or index file) is corrupt in any way, the system tries to rebuild it, if possible, or salvage as much as possible from it. XFS is our filesystem of choice for the locally attached storage that holds the CloudFFS datafiles. Each of those datafiles can hold millions of files ( each namespace has its own capacity / datafile threshold ); typically each immutable data file is around 1GB in size, but can be TB in size - there is no hard limit there. All those datafiles make up the namespace.
Accessing data
A file/object is accessible at http://domain/namespace/id. e.g http://x.pstatic.gr/me/305896160755729.jpg. The ID is a 64bit integer that identifies the file to be retrieved. An alternative way to access a file is by accessing http://domain/namespace/n/string. In this case 'string' is used to generate an 64bit identifier. e.g http://x.pstatic.gr/avatars/n/M/96/5/22/markp.jpg. In addition, alphanumerical characters can succeed the 64bit identifier - those are mostly ignored, though a filename extension, if present in that string of characters, is used for identifying any rules specified in the configuration file for special treatment of with said extension. For example, you can specify that 'css' files content type will be text/css, and that they expire within 1day since they were created and that they can compressed if the client supports compressed content. Those kind of options can be set on a per namespace basis and on a per namespace extension basis. The HTTP service will look for Last-Modified and ETag headers and will respond with HTTP 304 Not Modified if needed.
We are migrating existing static files into CloudFFS but so far it has worked great for us; very fast access to data (0.005 seconds for a typical file), a few dozen files to manage instead of millions, easy and efficient access to the files for our developers. Our forthcoming CloudFS project will provide more features (TB scale files, random read/write access to files, distributed storage and fault tolerance, etc) but this service is far more suitable for the kind of static files we have and keep creating every second. Maybe someday we will get to talk more about the services that we built and run in house.
Thursday, 27 January 2011 11:59 pm
iPhone 4 home screen
Stelios tagged me so here is my iPhone 4 home screen. I am supposed to pass the torch to someone else, so my brother will take it from here.
Thursday, 18 November 2010 8:33 pm
Developing for Android and Windows Phone 7
I spent a few (no more than 2) hours on developing for Android, and the better half of a day building for Windows Phone 7 / Silverlight, mostly because I wanted to learn enough to understand the development model of both platforms, and compare said models to iOS's.
I went through Android source code when it was released ( spending most time on the 'Skia' 2d drawing component ) and figured out through the basic concepts. Its been a long time since then.
I was drawn to Windows Phone 7 because I really like the UI. Its clean, simple and elegant. Its also fresh and unlike iPhone's UI(everyone is copying Apple, left and right, as it has always been the case). When Silverlight was announced, I looked into it but was put off by the use of XAML files and some weird naming decisions in the classes tree. Other than building a trivial 'lets see what this is all about' application, I didn't spend more time on it.
As expected, Android and WP7/Silverlight also adopt the familiar views/controls paradigm. On Android, you got tasks (processes) and each holds a stack of Activities; an activity is more or less a page, that holds a content view. That content view is usually a container view that contains other views. Activities do not need to come from the task that owns the activities stack. They are popped out in a LIFO fashion, and its alls simple and nice. You also get services ( really, tasks with no front-facing UI, which are cool ), intents ( effectively, messages with action and payloads ) and other niceties.
It all more or less make sense - the one thing I don't like about Android is the UI of the controls. Everything is ugly. The emulator is also slow and, well, ugly, which makes things even worse than they probably are. Google is no Apple, sure, but they should have done something about it. It all reminds me of those Java Swing components (or even worse, AWT components used in Java Applets when applets were cool - which is a long, long time ago ). You get to use Java to build the applications. I am not fond of Java, but I don't really mind it ( a bit too high level for my taste, among other things). You also get to use Eclipse ( which makes it really easy to build stuff, with intellisense, on-the-fly compilation and all those nice things people expect nowadays from IDEs ), or use any other IDE or just use the tools on the terminal , if you don't like IDEs or for whatever other reason. That's what I did. The tools are easy to use and it takes very little amount of time to feel comfortable enough with the environment.
I spent an hour or so trying to find my way around Windows Phone 7 Silverlight concepts and paradigms. Those XAML files, that I hated on SL back in the day, were still here and I just didn't want to deal with them. It turns out, that unlike what the documentation may make you believe, you don't need to use them. You can delete them and do everything programmatically ( though its not straight forward, but it make sense once you do it once or twice ).
You need to decide if you want to build a Silverlight application, or an XNA application. If you want to build 'high performance' games, you need to build an XNA app. Otherwise you will want to build a Silverlight app. They are concepts and classes that are unique to each approach and it just doesn't feel right, having to restrict yourself to either of those as opposed to building an application that can access all facilities offered by the device. You are going to use C# to build Windows Phone 7 applications. C# is nice. Java on steroids. Visual Studio 2010 Express for Windows Phone is a free download from Microsoft, providing everything you need to build your applications.
So, you got your Application instance ( every WP7 application needs a class that derive from System.Windows.Application ). Every application instance has a RootVisual property. Its the main application UI ( a System.Windows.UIElement derived class instance ). The convention/requirement on Windows Phone is to have a special class instance set to RootView(PhoneApplicationFrame) and that should hold a PhoneApplicationPage derived class instance ). That page in turn holds a Content - which is the content view, etc. Similar concept to Android's activities and their content view, and iOS's Windows, Navigation controllers and their views. Again, simple stuff - just make sure you stay away from XAML documents.
By the way, there is no support for multitasking on Windows Phone 7, unlike Android and iOS. There is no supports for sockets, either, which is weird and rather sad ( Access over HTTP and 'web services' is not good enough ). Hopefully, this will change soon.
I am going to build a 'real' application for Windows Phone 7 in whatever spare time I have this week and submit it on their Market Place. It should be fun; if nothing else I will should learn enough to help our Mobile Unit folks at work with upcoming Windows Phone 7 projects.
Sunday, 17 October 2010 1:10 am