Data Science Cornwall: December 2018

This month's meetup was on the topic of designing technology services for growing data.

We also had short talks on making home sensor data open for exploration, and proposing that AI improves game narratives.

A video of the talks is on the group youtube channel: [link]

Slides for Designing For Big Data are here: [link].

Slides for the Smartline project are here: [link].

Slides for the AI Game Narrative talk are here: [link].

Short Talks

Ian Mason from Exeter University explained the Smartline project for-social-good research exploring sensor data from homes in Cornwall, and an invitation to participate in data mining hackathons.

The project is using sensor data installed in selected Cornwall homes, and is investigating how that data can be used to improve the lives of residents, focussing on health and happiness. Examples include predicting house infrastructure faults, and monitoring energy consumption and cost.

The project is looking to open up the data for the wider community of data scientists and businesses for open-ended research to provide additional value. This work could lead to further funded research or the development of business products.

The nexts steps are a formal launch of the data [event link], and subsequent workshops organised by Data Science Cornwall.

John Moore's talk proposed that, although games have drastically improved visual detail, rich interactivity, audio effects and overall immersion, the core narratives and storylines have not improved significantly. He suggested the development of AI techniques to automate the development of such narratives and plot-lines. This is a unique idea, and worth exploring. A suggestion from the audience was to explore existing efforts for generating works of fiction as a similar challenge.

The Data Success Problem

A successful product will often see a growth in the data it stores and processes.

The downside of this success is that:

data storage can grow beyond the limits of a single traditional database, and

data processing can outgrow the processing and memory limits of a single computer.

Rob Harrison is experienced in designing, and remedying, architectures which support larger and growing data.

Scaling Basics

In the past, a common approach to dealing with growing data and processing needs was to grow your computer - faster processor, larger storage and bigger memory - vertical scaling.

This can work up to a point, but very quickly the cost of large computers escalates, and they remain a single point of failure. Ultimately you won't be able to buy a big enough machine to match your data growth.

A better approach is to use multiple but relatively small compute and storage nodes. Collectively the solution can have a larger storage, processing or memory capacity, enough to meet your needs. Benefits of this horizontal scaling approach are:

cost - each unit is relatively cheap, and collectively cheaper than an equivalent single large machine.
resilience - if designed correctly, failure of some nodes isn't disastrous, and can often appear to end users as if there was no problem.
parallelism - lots of compute nodes give the opportunity to processes data in parallel, improving performance for amenable tasks.
growth - the ability to incrementally add more (and just enough) nodes to meet growing demand.

This fundamental shift from vertical scaling to horizontal scaling has driven a wide range of changes in software and infrastructure architecture over recent decades, from the emergence of multi-core computing to distributed databases.

Assessing Products & Solutions

Rob presented a framework for assessing products and solution architectures. The key points can also be considered design features and principles for designing your own solutions where proportionate to your needs.

He noted that a good design doesn't limit storage - that would not be useful if your data grows. Furthermore, it should scale not just data reads but also writes. The latter is a tougher requirement and some products don't do this well.

Good architectures should be resilient to failure of individual components, and this is typically achieved through redundancy. Related to this is the ability to failover automatically. Too often, products remain off-line for hours as they fail over, which doesn't match modern customer expectations.

Rob made an interesting point about horizontally scaled solutions which don't work if there is any heterogeneity in technology versions. In some sectors, it is a requirement to run several versions of a technology to avoid total failure where a fault affects a specific version.

He also made a point about technology that is aware of its physical location, within a rack or a geographical zone, to better optimise data flows amongst nodes.

An important point made by Rob is that any solution shouldn't lock you into a single vendor's products, or indeed a single cloud. This requirement for sovereignty over your own data underlines the importance of open source technology and cloud vendor agnostic platforms, such as Docker.

State Then And Now

State is just a word that describes the information in, and configuration of, a system. Applied to technology products, the state often refers to user data. It is this data that grows as a service grows.

The complexity of state is the major driver of how complex a system is to manage and scale.

The standard architecture of many technology platforms is often described as in this diagram.

It shows a user device accessing a service over the internet, with requests being distributed to one of several web servers by a load balancer. We can add more web servers to meet demand if needed. So far that's fairly resilient and scalable.

The diagram shows all those web servers sharing a common database and perhaps a common file server. These are the single points of failure and potential bottlenecks.

Rob then shared a more complex architecture that more truthfully reflects the reality of today's services.

Often the application logic is not located in web/application servers, but in an app on the user's device. That device will have its own local data store. The device will connect to API servers over the internet, but that connection is rarely permanent, and sometimes poor. The server side datastore is often not a traditional relational database but a less-rigid data store that can store data structures that better match the application, JSON documents for example. The term NoSQL has emerged to describe a broad range of such data stores.

Many of these modern NoSQL data stores offer a range of choices for how they scale, allowing developers a reduction in consistency in return for much easier scaling and resilience. Many applications don't need, or can tolerate, a write operation taking its time to asynchronously replicate to multiple nodes.

Rob's picture also refers to OS for object store. These store data in the form used by applications, and not broken into fields and forced into tabular form, only to be reconstituted when queried. He also refers to PNS for publish and subscribe - a pattern for publishing messages to be picked up by interested subscribers when they're ready. This loosely coupled approach works well for synchronising state across intermittent internet connections.

Another advantage of some modern NoSQL data stores is their flexibility with data schemas. That is, they don't firmly insist that stored data all conforms to the exact same schema. This makes much easier the iterative agile development of applications. An up-front one-shot data design is unlikely to meet future needs as a product evolves.

Rob offered some wisdom in his analysis of the emergence of modern NoSQL data stores:

Traditional relational databases emerged from a time when storage was expensive, and so huge effort was put into normalising data to reduce duplication. Today storage is cheap, and the cost of normalising and denormalising, and the risks of a fixed schema, are no longer tolerable.

Today's data stores are better matched to the needs of application developers. They are able to store objects in formats closer to those in the application, such as key-value pairs and JSON objects, binary objects and even graphs of linked entities. Flexible schemas support the reality of applications that iterate and evolve.

Today's data stores aim to meet internet-scale demand and the availability expectations of empowered customers.

What's So Wrong With Traditional Databases?

Rob discussed the challenges of traditional databases throughout his talk so it is worth summarising the key points:

Relational Database Management Systems (RDMBS) were, and still are, the most common kind of database in use. Today's popular databases like PostgreSQL and MySQL, though extended with modern capabilities, have their roots in decades old design assumptions.

One assumption is that data storage is expensive, and that effort put into normalising data, linked by keys, is worth the effort of decomposition and reconstitution. This assumption isn't particularly valid today.

Although the data model of fields in tables linked by keys can be useful, it can sometimes lead to catastrophic performance failures when the right indexing isn't anticipated or the joining of such fields becomes complex.

Many traditional relational databases aim to be ACID compliant, which can be simplified as being always correct at the expense of latency. This can lead to performance bottlenecks as an entire table or even database is locked during an update. Many of today's applications don't need this level of transactional consistency, and many only require eventual consistency.

Traditional relational databases weren't initially designed to be horizontally scaled across nodes. This means that attempts to make them work this way are more of a retro-fit than an engineered solution. Issues and challenges include locking across nodes and inability to handle data inconsistencies caused by inevitable network failure, exacerbated when the links are across distant geographies. Aside from infrastructure, the challenges continue at the logical level, for example, keeping universally unique identifiers for database keys consistent is a challenge across multiple instances of a traditional database, particularly when rebuilding a failed database.

A standard approach to scaling relational databases is sharding - splitting your queries amongst your nodes. The following picture shows the simple splitting of database queries so those related to user names beginning with A-D are processed by the first shard, and those with U-Z are processed by the last shard.

Although this seems like a fine idea, it fails when a shard can no longer meet demand and further scaling is needed. Unique database keys make migrating data to a new architecture very difficult, without rebuilding all the data again. Another problem with sharding is that the demand profile can change leading to over-used and under-used shards. Again, rebalancing the queries requires a database rebuild as we can't simply shift data from one shard to another.

In summary, traditional relational databases were fine for the time they were initially designed. They are useful today, but we now have more choices for data storage and processing that better match modern internet-scale design goals and user expectations.

Modern Tech Stack

Rob the talked us through the technology stack from low-level memory up to filer servers, comparing historic approaches and modern technologies.

Memory Is Faster Than Disk

An early method of accelerating traditional web applications was the use of memory-based caches. That is, temporary in-memory stores that very rapidly served responses to queries that had been seen before, and for which the response had been calculated one before.

Although the original setting for these caches was web queries and responses, the idea easily generalises to other kinds of queries and responses, including data store queries and responses.

Today these memory-based caches can coordinate across multiple nodes, allowing the scaling of the cache beyond the limits of a single machine.

Memcached and Redis are to popular choices for a memory-based cache for data stores, with memcached being a better choice for simpler and smaller data structures, and redis being better for more sophisticated data structures.

Such memory-based caches should be considered as accelerators reducing the load on data stores, and not as key methods for scaling the stores themselves.

NoSQL

Rob proceeded to discuss examples of so-called NoSQL databases, built to different design goals from traditional relational databases, and which aim to be much more easily scalable and flexible for application developers.

To explain the terminology, SQL is the query language most commonly used with relational databases. It was developed in the early 1970s! Modern data stores initially didn't use this language, so they became termed No-SQL, but today some do offer the ability to query them using SQL-like queries and so No-SQL can mean "not-only" SQL.

Rob explained how Google was an early leader with BigTable which can be thought of as a two-dimensional key-value store, with one key being a row and the other being a column identifier. The contents of the value don't need to conform to a schema. Google describes their public BigTable service as suitable for petabyte scale data whilst still providing a performant services (sub 10ms latency). A number of Google's own services are thought to run over their own BigTable implementation.

Today BigTable supports a wide range of businesses, processing a wide range of data from financial time series data to user profile data.

Facebook open-sourced its implementation of a similar wide-column database, Apache Cassandra. Because it is open source, we can see how it works. Key design features include:

Distributed across nodes, where every node has the same role, if not data. There is no single-point of failure, and no concept of a master or slave hierarchy.

Aim to scale both read and write fairly linearly as nodes are added.

Fault-tolerant to node failure, and designed for nodes to be replaced with no service downtime.

Tuneable consistency, from "issue write and don't worry when it completes" to "block everything until data written to at least 3 nodes, and they all confirm it".

These are typical objectives of several modern NoSQL data stores, and the contrast with traditional databases is stark!

Cassandra has been used successfully by organisations such as CERN, Apple, and Netflix.

Rob also mentioned Couchbase, which is often used to store JSON objects, ideal for application developers and supporting REST APIs. The following chart shows its write performance compared to other NoSQL databases. Couchbase remains performant at 20,000 writes per second compared to MongoDB which degrades at 6,000 on comparable systems.

You can read more about how Couchbase on Google Cloud infrastructure reached 1 million writes per second, of over 3 billion items, using just 50 nodes [link].

File / Object Storage

Sometimes our data just isn't particularly structured and would normally be stored on a file server.

Modern options for storing arbitrary files or objects include the popular Amazon S3 service. Although not particularly fast, such services are very cheap and very resilient, offering availability of 99.9%, which means it should not be down for more than 43 minutes per month.

After Amazon's lead, other providers have offered competing object storage. Almost all of them are designed to be accessed programmatically, offering easy to use REST APIs. Some even offer APIs compatible with Amazon's S3 to ease development, migration and multi-vendor or hybrid-cloud architectures.

The term object here just means an arbitrary bunch of data, like a file, rather than a structured object used by application code.

An illustrative use for such low-cost, if not very fast, object stores is for user generated photos.

Distributed Computation

Distributing the storage and retrieval of data from one node to many nodes can improve performance and resilience. The same can also work with computation.

Instead of a single node performing calculations on data, the computation can be spread over many nodes. Many, but not all, tasks are amenable to being split into smaller parallel tasks. The performance benefit of performing these tasks in parallel is very attractive.

A good illustration of the idea is the task of counting the number of words in a book. One node can count the words from the first to last page. Alternatively, many nodes could be working on a chapter each, working in parallel. If those nodes weren't sufficiently performant, those chapters could be further divided into paragraphs distributed to further nodes to count.

The most famous example of this approach is Google's MapReduce, an idea which has been implemented by others. Apache's Mahout is a notable example of numerical and machine learning algorithms implemented over a map-reduce framework.

Rob's discussion led to Hadoop, probably the most recognisable name amongst the new wave of data storage and compute technologies. Hadoop is not one, but a collection of technologies including a distributed filesystem HDFS, a wide-column store HBase, a MapReduce engine, and task managers.

Hadoop is often lauded as the solution to many data problems, but its main focus is distributed computation, and other technologies may be better choices if your task is storing and retrieving data but not distributed computation.

Functional Programming

Rob also shared his thoughts on application development. One theme was particularly interesting.

It is a reality that many programming languages don't protect against accidental changing of data beyond the intended scope. As application code and logic gets bigger and more complex, the risks of these unintended side-effects grows. Add to this mix, parallelism and we have a new class of potential errors.

Functional programming languages impose a restriction on functions such that they can't have any side-effects, except those that are explicitly intended, and even these are tightly controlled. This is done not by complexity but simplicity. The benefit is that sophisticated and complex applications can be composed of these basic functions, and we can verify that the fuller code also doesn't have unintended side-effects.

As a bonus, the implementation of these functions requires them to independent in operation and that makes them very easily parallelisable. Functional programs naturally benefit from multi-core and multi-node hardware without additional effort on the part of the developer.

The last 5 years has seen a growth in demand for functional programmers in some sectors, because large complex distributed applications are easier and safer in functional languages.

Conclusion

Rob's talk was well anticipated and well received. Many members commented that his talk had forced them to rethink their assumptions and designs, or broadened their options for future projects.

My own summary of Rob's message is that today's data technologies

aim to support internet-scale services, and the high expectations of modern users.

are designed from the ground-up to scale horizontally.

allow application developers to choose their own balance between performance, consistency and availability.

Data Science Cornwall

Friday, December 14, 2018

How To Design For Big Data