Data Buckets

For me, software development is just a nice way of saying ‘bit moving’. A good friend of mine used to describe himself as a bit reorganizer. We rearrange invisible magnets, he would say, setting their tiny arrows of residual currents to point this way or that. We are a bunch of “bitniks” and we are all about data.

Application design and development has been my main source of income for the last decade or so, it struck me as odd that there were so few terms that describe so many kinds of data.

It occurred to me, that the Eskimos had their fourteen words for snow, and they say that the Bedouins have nine words to describe sand. I felt so alone. I felt a need for discovering my own flavors of data. It took me a while, but then in a single perfect moment of clarity, I had realized what lay before me.

The orchestration of the moment was this; in the middle of a design meeting, yelling and shouting all around, we were discussing optimization and performance and spirits were high. My thoughts went back to when I have learnt about application design. The fact was that when developing any business application, the first step to take is to determine the set of business flows that describe the scope and functionality of that application within the organizations it’s meant to serve.

Listing these business-flows by rank and cardinality is no bother at all. The simplest evaluation I could think of is according to frequency of use and the sheer number of users that would eventually use the flow. I thought that the categorization of the data body by the same yardstick could provide me with the flavors of data I was looking for.

Java is a wonderful language, my favorite actually, versatile and strong. In the context of this discussion (!), Java has one drawback; the Java Virtual Machine is located far, so far away from the data acted upon. Unlike the hieroglyphic COBOL, java needs special machinery to access its data. In this case, the number of solutions testifies for the complexity of the problem. It’s safe to say that there are absolutely no free meals. Every solution ever invented to accommodate the data access issue, bears with it its own cost and complications. Careful mapping of the data orientation by category and flavor might reduce the friction in complex systems that depend on the availability of massive bulks of data.

And here I am getting to my point: The mapping of the data reduces the friction in complex systems and mapping of the data needs more flavors.

The topology of data within an application

I have managed to put together five distinctions of data flavors, but first, I will describe my study case application and define the yardstick I am using to categorize data. My example is an application that sells insurance policies. The simplified outline of such an application would have a customer base and a product list. It would also include a process for the selling of insurance policies, implemented by stapling the products to customers. To make it interesting, I will refer to the use of external services and fixed configuration. The yardstick for data categorization is determined by measuring the cardinality of each of the application’s work flows. It is easy to see that for the hierarchy of the business flows, the main workflow is for the selling an insurance policy (stitching the customer to the product), followed by the work flow for managing the customer base. Far behind would be the work flows for creation, versioning and maintaining of the product list.

Applicative data bucket

The applicative data bucket is the body of data that is manipulated by the main business flow and has the highest rate of change. I’m Strictly speaking of altered (modified) data only. In my example application, this would be the data that is handled in the policy selling work flow. The data consists of the stitching tables between the products and the customer. The stitching tables may also describe a single shopping cart or a single contract with the customer. In many cases the data that is added or modified is within the boundaries of a single session there may be no reason for cache optimization. In cases where concurrency of change for the same data is permitted, the synchronization between sessions should be handled with great care and understanding of the business implications. It’s important to remember that deadlock issues are ten times easier to handle from the business standpoint then technologically. There are several solutions for second level caching, just to name a couple, there is the EH cache project which I’m using in the product I’m working on and I have also heard about the terracotta project. Choosing to implement the second level cache independently is also an option; it is an easy implementation as long as the cache stays on the same virtual machine. But scaling a cache solution to a clustered environment is a different ball game, and in this particular case, my policy is to make use the effort of others, and not waist my own resources on an existing product.

Reference data bucket, first degree

The first order of reference data is data that is used as “read only” by the main business flow. Yet this referenced data is a moving target since close-by flows (flows that are ranked closely to the main flow) change it constantly. In the example application, the customer base management work flows rank in second. The customer base is modified intensively, when additional customers are introduced to the database or when existing customers change status and detail. Having two possible concurrent sessions (the main business session and the customer data base update session) both accessing the customer data requires special attention and awareness to the business implications of the concurrent modification. In the calculation of insurance premium rates, the payment is determined in accordance with personal parameters and the record of each individual customer, thus changes need to be communicated instantly. Thereby, synchronization of the concurrent sessions is a must. Second level cache or any other innovative solution that allows live cache update between the competing sessions is advised. However, since the messages are sent one way, some application friction may be reduced.

Reference data bucket, second degree

The second order of reference data is data that is referenced by the main business work flow as read only, yet changes by other business flows are in a very low frequency. In my example application, these are the business flows that manage and maintain the product lists. The product list maintenance is usually handled by a few individuals in the organization, and rate of change is very low since the products in the list go through meticulous testing and examination before “going public”. On top of that, the relevant product for the main business flow for selling insurance policies is the subset of products that are complete and ready to be sold. This implies that in relation to the selling work flow, the product list is static. The caching implementation for the product list could then be very simple. A cache pocket for the product list could be refreshed by messaging or time based, while remaining static as far as the main business flow is concerned.

System configuration

Data that is retrieved from system configuration, property files, XML data structures or tables belongs to the data bucket that reflects changes only when the system is rebooted. Relating to the example application, this bucket may contain anything between configuration of connection pool sizes and i18n (internationalization bundles). In my view, the special attention has to be for deciding what not to cache.

External data services

External services include a wide range of functions that have only one aspect in common; the implementation of these data sources is out of scope for the application being developed, sometimes out of company. My way to relate to cache optimization for external data services is by using the same ranking method as before but to devise which of the services caching is irrelevant and for which it would be beneficial. For the insurance example, I would never cache services that have a narrow scope of relevance; a service that validates bank accounts is too volatile to cache. On the other hand I might consider caching data for age group premium rates. My view of best practice in optimization in the case of external data service would be to exclude the task of updating the cached data from the thread that services the business flow. Instead, I would consider maintaining an independent thread that would checks for data modifications every once in a while, and updates the cached data independently

My Eskimo vocabulary The basic motivation for refining the resolution in data terminology is for the optimization of cache implementations. There is an Eskimo saying that pre-mature optimization is the source of all evil (no its not, I made that up). In most cases this is absolutely true, but I would like to argue that cache optimization is elementary to the degree that has to be addressed in early stages of application design. But in any case, I have found my little Eskimo vocabulary for data in an application and I am happy.

my blog: HANDS ON

Related: