Henri's Blog - Haskey: User-defined Schemas, Monad Transformers and Future Work (Summer of Haskell 2017)

In this blog post we present the features we’ve added to Haskey during the last weeks of the Summer of Haskell project. These features include user-defined schemas, multi-table support and a monad transformer that supports Haskey transactions. They are vital for the usability of Haskey, which means Haskey can finally be experimentally incorporated in serious projects. We also believe that the inclusion of these features provide the necessary functionality for Haskey to be a successful project in the context of Summer of Haskell 2017.

Haskey is an ACID compliant embedded key-value store entirely written in Haskell. It was developed as part of the Summer of Haskell 2017. This blog post is a follow-up to the previous blog post on Haskey: Introducing Haskey. Along with the features presented in this blog post, the current version of Haskey fixes many bugs that severely limited Haskey’s usability. As a result, you can now start using Haskey for serious experimentation. We also very much welcome pull requests and all criticism to help ensure the future of Haskey as a community project (more on that in the final section).

EDIT (9-Apr-2018): Note that this document might use outdated APIs. See this up to date tutorial, which covers a lot of the content in this article, to get started with Haskey.

User-defined schemas and multi-table support

Haskey stores all data in a tree-like structure called a B+-tree, where the nodes are identified by page numbers pointing to pages in the database. Two specially designated fixed pages contain the metadata of the database. The ConcurrentMeta data type holds this metadata. In the previous version of Haskey, the metadata contained one pointer to a B+-tree root, allowing us to only store one tree (also called table) in the database. In the current version, however, the ConcurrentMeta data type is parameterized over a user-defined root type, as can be seen below.

-- Simplified data type definition...
data ConcurrentMeta root = ConcurrentMeta {
    concurrentMetaRevision :: TxId
    -- Some record fields omitted...
  , concurrentMetaRoot :: root
    -- Some record fields omitted...
  }

This user-defined data type will be stored in the metadata and is passed along to and returned by transactions, as can be seen from the type signature of the transact and transactReadOnly functions. As long as our root type is an instance of the Root type class, this essentially allows the user to store and manipulate multiple database trees.

transact :: (MonadIO m, MonadMask m, ConcurrentMetaStoreM m, Root root)
         => (forall n. (AllocM n, MonadMask n) => root -> n (Transaction root a))
         -> ConcurrentDb root
         -> m a
transactReadOnly :: (MonadIO m, MonadMask m, ConcurrentMetaStoreM m, Root root)
                 => (forall n. (AllocReaderM n, MonadMask n) => root -> n a)
                 -> ConcurrentDb root
                 -> m a

The usage of lenses simplifies querying and manipulating a database with a user-defined schema. As can be seen in the full code example in the section below: Full code example.

The HaskeyT monad transformer

Furthermore, we’d also like to introduce the new haskey-mtl library. This library contains the HaskeyT monad transformer, which is an instance of the MonadHaskey type class, which has the following definition:

class Monad m => MonadHaskey root m | m -> root where
    transact :: Root root
             => (forall n. (AllocM n, MonadMask n) => root -> n (Transaction root a))
             -> m a

    transact_ :: Root root
              => (forall n. (AllocM n, MonadMask n) => root -> n (Transaction root ()))
              -> m ()

    transactReadOnly :: Root root
                     => (forall n. (AllocReaderM n, MonadMask n) => root -> n a)
                     -> m a

The HaskeyT monad transformer essentially allows you to incorporate access to a Haskey database in your application’s monad transformer stack, by using the runFileStoreT function. Let’s move on to a full code example.

Full code example

In this section we will dissect the example application included with the haskey-mtl library, of which the code can be found on GitHub. We will skip the imports and immediately jump to the definition of our application’s monad transformer stack. Our App monad is built using a ReaderT and a HaskeyT and can simply be run using the runApp function.

newtype App a = AppT (ReaderT String (HaskeyT Schema IO) a)
              deriving (Functor, Applicative, Monad, MonadIO,
                        MonadHaskey Schema, MonadReader String)

runApp :: App a
       -> String
       -> ConcurrentDb Schema
       -> FileStoreConfig
       -> IO a
runApp (AppT m) r = runFileStoreT (runReaderT m r)

Let’s now define the Schema of our database. We will store a collection of tweets, identified by their unique identifier, and a collection of users who have sent out those tweets, in two separate trees. The Schema data type instantiates the Root type class, which allows us to use it as the type parameter to a ConcurrentDb. We also define two lenses to access the fields in this data type, called schemaTweets and schemaUsers.


data Tweet = Tweet {
    tweetUser :: !Text
  , tweetContent :: !Text
  } deriving (Generic, Show, Typeable)

instance Binary Tweet
instance Value Tweet

data User = User {
    userName :: !Text
  , userEmail :: !Text
  } deriving (Generic, Show, Typeable)

instance Binary User
instance Value User

data Schema = Schema {
    _schemaTweets :: Tree Int64 Tweet
  , _schemaUsers :: Tree Text User
  } deriving (Generic, Show, Typeable)

instance Binary Schema
instance Value Schema
instance Root Schema

emptySchema :: Schema
emptySchema = Schema Tree.empty Tree.empty

schemaTweets :: Lens' Schema (Tree Int64 Tweet)
schemaTweets = lens _schemaTweets $ \s x -> s { _schemaTweets = x }

schemaUsers :: Lens' Schema (Tree Text User)
schemaUsers = lens _schemaUsers $ \s x -> s { _schemaUsers = x }

Lenses allow us to very easily query and manipulate trees in our custom schema. We define some functions using lenses to query and manipulate the database. We use functions like insertTree and lookupTree from the haskey-btree package to query and modify the underlying B+-trees. These actions run inside an AllocM or AllocReaderM monad which provide read-write and read-only manipulations of the B+-trees. We can use the transact and transactReadOnly functions from the MonadHaskey type class to execute these functions.

-- | Insert or update a tweet.
insertTweet :: AllocM n => Int64 -> Tweet -> Schema -> n Schema
insertTweet k v = schemaTweets %%~ insertTree k v

-- | Query all tweets.
queryAllTweets :: AllocReaderM n => Schema -> n [(Int64, Tweet)]
queryAllTweets root = toList (root ^. schemaTweets)

-- | Query a tweet.
queryTweet :: AllocReaderM n => Int64 -> Schema -> n (Maybe Tweet)
queryTweet k root = lookupTree k (root ^. schemaTweets)

-- | Insert a new user.
insertUser :: AllocM n => Text -> User -> Schema -> n Schema
insertUser k v = schemaUsers %%~ insertTree k v

-- | Query a user.
queryUser :: AllocReaderM n => Text -> Schema -> n (Maybe User)
queryUser userId root = lookupTree userId (root ^. schemaUsers)

We now have all ingredients to write our application using our custom App monad. Our application simply stores some tweets and users in the database, and then prints them all out to the console.

main :: IO ()
main = do
    let db = "/tmp/mtl-example.haskey"
    putStrLn $ "Using " ++ db
    main' db

main' :: FilePath -> IO ()
main' fp = do
    db <- flip runFileStoreT defFileStoreConfig $
        openConcurrentDb hnds >>= \case
            Nothing -> createConcurrentDb hnds emptySchema
            Just db -> return db

    runApp app "Hello World!" db defFileStoreConfig
  where
    hnds = concurrentHandles fp

app :: App ()
app = insertDefaultTweets >> printTweetsWithUser

insertDefaultTweets :: App ()
insertDefaultTweets = do
    transact_ $ \schema ->
        foldlM (flip $ uncurry insertUser) schema users
        >>= commit_

    transact_ $ \schema ->
        foldlM (flip $ uncurry insertTweet) schema tweets
        >>= commit_
  where
    users = [("foo", User "Foo" "foo@example.org"),
             ("bar", User "Bar" "bar@example.org")]
    tweets = [(1, Tweet "foo" "Hey, I'm Foo!"),
              (2, Tweet "bar" "Hey, I'm Bar!"),
              (3, Tweet "foo" "I like you, Bar!")]

printTweetsWithUser :: App ()
printTweetsWithUser = do
    tweets <- map snd <$> transactReadOnly queryAllTweets
    users  <- mapM (\t -> transactReadOnly $ queryUser (tweetUser t)) tweets
    mapM_ print' $ zip users tweets
  where
    print' (Just user, tweet) = liftIO . putStrLn $ unpack (userName user) ++ ": " ++ unpack (tweetContent tweet)
    print' (Nothing  , tweet) = liftIO . putStrLn $ "?: " ++ unpack (tweetContent tweet)

We now have a fully working application that uses a Haskey database to store and query information. When we run the application, we get the following output:

Using /tmp/mtl-example.haskey
Foo: Hey, I'm Foo!
Bar: Hey, I'm Bar!
Foo: I like you, Bar!

Outlook and future work

This blog post signals the end of my contributions to Haskey as part of the Summer of Haskell 2017 project. However, I hope to keep working on Haskey in the future, albeit to a lesser extent, due to my ongoing studies and other projects I’m planning to take on. It is our hope that the Haskell community will start experimenting with Haskey, report a lot of issues, make feature requests, and guide the future of Haskey, along with Steven and me. Anyone that is interested in contributing to Haskey can directly contact me for more information.

To give you an idea where Haskey is going, we’d like to give you the current state of affairs and an outlook on future work. Currently, Haskey is ready for serious experimental use. It provides user-defined schemas with multi-table support, concurrent readers and serialized writers, efficient disk space usage, and a monad transformer-based integration into your applications. Haskey is thus ready to be used in your application as a key-value store. However, the focus during this Summer of Haskell was to develop features and stabilize the binary format rather than to work on performance which we hope to address next to make Haskey performant, scalable and competitive against other solutions (with ACID guarantees) like SQLite or LMDB. We want to exploit the fact that Haskey is written in Haskell to gain better integration with Haskell’s runtime system – in particular its I/O and concurrency sub-systems – than solutions written in other languages. We hope this gives our project a competitive advantage.

In the future we would like to build on Haskey to provide persistence functionality at a similarly high level as acid-state. The acid-state package is a popular high-level persistence solution which is strikingly simple because it uses a pure Haskell value as the storage and consequently is unbeaten in transparently hiding the persistence layer. However, acid-state is an in-memory solution, which only writes logs and checkpoints to disk for durability, and comes therefore with a set of disadvantages. Keeping the state completely in memory is resource intensive and thus limits the scope of acid-state to applications which either have a small state or require the state to be in-memory anyway, e.g. because of latency requirements. For applications that have a large historic and inactive data set, but a small active data set, this is a waste of resources. Furthermore, loading the state and saving checkpoints may take a long time, since the complete state needs to be read or written. By building on Haskey we intend to develop a solution for applications with medium to large data sets that do not fit in-memory. Related to this idea are projects like vcache and muesli which we hope to either integrate with or learn from.

Next to a ready-made easy-to-use solution we would also like to provide more customizable functionality to users by exposing internal components of Haskey such as haskey-btree. That way we give users the opportunity to exploit application specific knowledge for optimizations and allow them to make their own trade-offs in terms of consistency and durability in exchange for throughput and latency. In particular in combination with future planned features like write-ahead logs, in-memory and on-disk caches with user-controlled eviction, optimistic concurrency control with user customizable collision resolution, e.g. via CRDTs (without replication), etc. Having Haskey implemented in Haskell makes it easy and efficient to directly integrate user-defined behavior at the algorithmic level.