Showing posts with label cdm. Show all posts
Showing posts with label cdm. Show all posts

Tuesday, March 31, 2009

The importance of semantics mediation




See here how to mediate semantics.


Wednesday, March 26, 2008

Transforming Canonical Message: answer to readers comment

A reader commented on my posting: Canonical Data Model is the incarnation of Loose Coupling. Let me walk through the comment:

I hope I understand you: A data provider sends its data in its own format.

Yes, that is correct.

A data consumer receives this message, converts it to a canonical data model, possibly based on the message type, and then transforms it to its own format.

No, that is not correct. The message is converted to a canonical format by a generic transformation service. This service queries the canonical data model to get the transformation rules. The canonical message is published for consumption by any interested endpoint. Before consumption by an endpoint, another generic service converts the message from the canonical format to the endpoint's format. So the endpoint consumes the message in its own format.

All of this is happening within the "global data space" layer.

Yes.
(You probably would merge the transformation rules, instead of performing two transformations)

No. The messages are converted near the endpoints; there will always be an intermediate canonical instance of the message traveling across the global data space. This simplifies the mechanism. If there are multiple data providers and/or multiple data consumers, merged transformation rules would lead to an exponential increasing number of transformations, and multiple instances (different formats) of the message would travel across the global data space. See picture below.



The picture shows one message type that is provided by two different sources and that is consumed by 4 targets. The left hand side shows direct transformations whereas the right hand side shows an intermediate canonical message instance.

If I am correct so far - please interrupt at any time ;-) - then both endpoints are completely decoupled.

Yes.

Let's assume that a new data consumer needs an additional piece of information, a piece of data which can be provided by the data provider. Wouldn't that mean that I have to change the transformation rules for both end points, because the canonical data model gets an additional field?

Yes, if the new data was not foreseen at design time of the canonical message, you will have to extend the transformation rules in the canonical data model AND have the provider deliver the new data. But if the data were available, it would have been wise to model that data into the canonical message, even if it were not required at that moment.

If the data is not available you might add a new service that enriches the original message. This pattern is known as the VETO pattern.

By modeling the canonical messages from an event-driven perspective - messages representing relevant business events - and not from a "currently required data" perspective you might decrease the need for change.

From a deployment view the whole "global data space" layer would become an atomic unit: A piece that can only be deployed in one piece. Is that a good idea when talking about a major backbone in the corporate's IT environment?

No, not quite. You should think of federated infrastructures for the global data space as well as for the canonical datamodel.

Domains need only know there own formats and semantics plus the canonical formats and semantics. Not those of other domains. Relevant canonical formats and semantic definitions could be pushed to the domains in a federated model.

If you don't have a federated bus infrastructure, messages can yet be propagated across multiple bus implementations as depicted below.



A service subscribes to a published message in Bus 1 and calls (synchronously) a service in bus 2 to pass the message reliably. The called service republishes the message in Bus 2. This is a simple method to pass published messages across multiple independent service bus infrastructures that are unaware of each other and yet being part of one global data space.

See also a nice article I referred to in this blog about a distributed implementation of the global data space.

Saturday, March 22, 2008

Canonical Data Model is the incarnation of Loose Coupling

Quote from a reader of my blog with regard to the Canonical Data Model:

The main issue I have is that someone has to come up with a data model that includes the information required by everyone - a superset - rather than a subset what a point-to-point connection requires. It seems to me that this is very difficult to achieve, from a design point of view - capture everything - to a governance point of view - who is going to own this and define what an object is - to a technical point of view - very complex objects, different versions etc.

The "superset" he is talking about is merely a metamodel of the data that point-to-point connections would require. The canonical data model is a federated collection of local metamodels including the definition of the common semantics and the format transformation rules. It need not be "more" than you need and it does not contain any stored application data.

To enable loose coupling a layer of indirection is defined in terms of a global data space, a canonical data model and canonical messages. This enables the mapping of semantics and transformation of formats between mutually unknown (decoupled) endpoints.

A good way to understand the mechanism is to view the canonical messages as the formally defined carriers of specific information throughout the enterprise. Data providers (sending endpoints) fill the appropriate canonical message using the metadata defined in the canonical data model. Data consumers (receiving endpoints) consume the data from this canonical message, also using the metadata defined in the canonical data model. In this way the endpoints don't need to have any knowledge of eachother.

The endpoints don't even need to know the canonical data model. Services delivered by the infrastructure (global data space), which has knowledge of the canonical data model, will take care of loading the data delivered by an endpoint into the appropriate canonical message (carrier) and unload the data from the canonical message to be consumed by the receiving endpoint. The endpoints only use their own formats and are totally decoupled.

You might recognize that in fact - from a software architecture perspective - the canonical data model is the incarnation of loose coupling.


Indeed it is true that this addresses a governance-aspect that nowadays in most IT organizations is not represented very strongly. If you want to reach the next level of IT maturity based on the ideas of SOA and EDA, it is a prerequisite to extent your governance with regard to formal semantics and format definitions as well.

Conclusion


The idea of the canonical data model is to define the semantics and formats from the local endpoint perspectives. To be able to map the endpoint interfaces in a loose coupling context (endpoints do not know each other), an intermediate mediation layer needs to be in place. The canonical data model is the underpinning facility that allows for the mapping of the distinct local semantics and the transformation of the distinct local formats between decoupled and independent endpoints.

So yes, it is right that maturing your software architectures requires maturing the required governance: loose coupling comes at the price of a tighter governance. On the other hand: evolving SOA governance tools are coming to help.

Sunday, March 09, 2008

Canonical Data Model visualized

This animation perfectly shows the principles and benefits of a Canonical Data Model.





Thursday, April 19, 2007

How to mediate semantics in an EDA



Sharing semantics

Systems that pass data to each other share commonly understood semantics. Explicit data semantics is the key to success in an EDA (and any other messaging system). In striving for loose coupling, data semantics is the ultimate level; when systems are decoupled at the semantic level - e.g. they don't share semantics - the coupling becomes useless, because in this case the systems will not be able to communicate at a logical level. Shared semantics is a prerequisite in connecting distinct systems, no matter whether it concerns EDA, SOA or any other form of EAI (Enterprise Application Integration). It should be obvious to anyone that analysis of data semantics will always be the first activity of any integration project.

In contrast to sharing semantics, distinct systems do not share formats that express these semantics. Think of different date formats or amounts (semantics: balance on a bank account) expressed in different currencies. Or think of different identifiers: CustomerName versus Custnm. The same semantics is expressed in different formats.

Mechanisms must be in place to harmonize between these different formats that carry semantically the same data from one environment to another environment. This pattern describes such a mechanism based on intermediate canonical formats for semantics representation.

Canonical Data Model (CDM)

In an EDA a business event is represented in a canonical format (presentation) with unambiguous semantics. This format and semantics are defined as canonical message types in the enterprise's Canonical Data Model. These messages are the core of the event-driven architecture and are valuable business assets that must be treated as such with regard to protection.

A message type may be invoked by several source systems in several environments. Several target systems in several environments may consume the same message. The environments that send and receive these messages don't need to know the canonical format. Every environment communicates in its own local format with the messaging system (typical the Global Dataspace implemented by an Enterprise Service Bus). A prerequisite is that every concerning environment has defined their local data formats and semantics in the CDM. The messaging system will provide services to transform the local format to the canonical format and vice versa. These services depend on the CDM.

There will always be a transformation from the local format at the sending side to the canonical format and there will always be a transformation from the canonical format to the local format at the receiving side. Even if the local format is identical to the canonical format a transformation will still be implemented. Such a null-transformation makes the mechanism generic and more agile when changes occur.

At design time, the definition of format transformation is not the only thing that must be accomplished. First of all correctly mapping the corresponding semantics from the local formats to the canonical formats is of utmost importance. Format transformation is the second step. Semantics mapping is vital to the success of the system, so in consequence defining semantics and recording these descriptions in the CDM is not an option, but a must if you want to succeed with EDA, SOA or EAI.

CDM, no commonly used datamodel

The CDM is not a storage component, but a metadata component. The CDM holds definitions of the local formats and semantics of the participating systems and the CDM holds the definitions of the canonical formats and semantics. There is not any persistent processing data in local or canonical format stored in the CDM. Also the CDM doesn't provide a common datamodel that everyone has to adhere to. Such a common model is no longer appropriate since we buy systems from the marketplace with their own datamodels and since we connect many systems from a variety of environments, old and new, sometimes in a B2B context or inherited from merging with other companies, each with their own datamodels, formats and semantics. The pattern described here, doesn't bother the systems owners with constraints on datamodels, formats and semantics. Everybody can use their own models, formats an semantics. Transformation services support the transformations of the shared semantics to and from canonical formats, using the definitions in the CDM. The sending and receiving systems are completely unaware of this; they talk in their own language.

Enrichment and translation algorithms may be part of the transformation services. This applies to different data representations with the same semantics, but it also applies to conversions of different but deducible semantics.

Example of data representation transformation

Two systems share the semantics for "railway station"; they both interpret the meaning of this entity type in the same way: a railway station involves a location and platforms and is owned by Dutch Railways; the rails are not part of it. However, one system uses alphabetic characters to identify a railway station. The other system uses numeric characters to identify the same set of railway stations. So one system identifies railway station Oudenbosch by "A" and the other system by "01". The canonical format uses even another set of identifying characters: alphanumeric. The transformation services must have knowledge of all railway station identifying sets and how they correlate. A persistent data set (e.g. a database) lies at the basis of the resolving algorithm of the transformation service.

In practise the case of this example may be rather complicated; think of how to keep the intermediate data set up-to-date if the connected systems may autonomously add new railway stations (or worse: change the railway station id's).

On the other hand there are also very easy translations, like translations of date formats or miles versus kilometer translations.

Note that all of these data representation translations can be bi-directional.

Example of correlating semantics conversion

In some cases it is possible to convert one semantics to another. Of course this is only possible if one semantics embodies the other one in some deducible way. Let's look at a strongly simplified example of a purchase order.
The canonical format of a purchase order consists of an order number with a set of order lines each with a part number, a quantity and the price of the concerning part on that line. The consuming system understands a purchase order as an order number and a total order amount. The transformation service multiplies the quantities by the prices and summarizes the resulting amounts.

You might argue that this example doesn't mention two semantics, but a different representation of only one semantics. You are right, it is ambiguous. On the other hand, the canonical format holds more data of the order than the consuming system does. So how can the semantics be same?

This is a simple example. In practise you may come across very complicated situations, where multiple complex data structures and complex algorithms are involved.

Note that the conversion can only take place into one direction.

Why?
  • Using canonical message types decouples systems at the level of message formats. Systems don't have to make assumptions or have to rely on other system's data formats. This is an important aspect in striving for loose coupling.
  • Defining canonical message formats creates the opportunity to supply the company with an unambiguous catalog of available messages about business events, representing valuable business assets. The business events in this catalog are independent from the sources that generate these messages. Based on this catalog policies can be implemented with regard to ownership and degree of free availability of data that is exchanged between domains. New business models may pop up with regard to data exchange. The catalog may contain rates associated with messages about business events. Publishing data about business events may be marketed: suppliers get paid for the published data by consumers. The IT-department delivers the market place (infrastructure) an may play a role as business events broker.
  • In a technical sense this pattern has a benefit in that at the endpoints only one transformation service per message type has to be configured. A subscriber needs to subscribe to only one message type, regardless whether there are multiple sources or not.
  • If transformations would take place directly between local formats (skipping the intermediate canonical format), transformation services have to be created for every source-target combination. This would lead to higher loads of management and maintenance efforts. Consumers would have to subscribe separately to every source of a particular message type and should in consequence have knowledge of the existence of these distinct sources.

  • Without intermediate canonical format a format change at the publishers side must be followed by changing all the transformations to the subscribers. Using an intermediate canonical format makes the transformations to the subscribers independent of changes at the publishers side.
  • Without canonical formats for semantics representation semantics would be represented in multiple equivalent formats. This obstructs the possibility to supply the company with an unambiguous catalog of business events independent from their sources. Also the lack of canonical formats will consequently cause system designs and resulting systems to be more complex and harder to change.