Document management system for Liferay Portal
Objective:
We want to scale the Liferay
to manage documents in hierarchical manner which we uploaded using Documents
and Media Library, where our Liferay is in a cluster mode and we can maintain
the data as easy as we can.
Different Ways to achieve DMS in Liferay:
Liferay
introduces a new Document and Media Library which is capable of mounting
several repositories at a time and presenting a unified interface to the user.
By Default, users can make use of the Liferay repository, which is already
mounted. This repository is built into Liferay Portal and can use as its
back-end one of several different store implementations. In addition to this,
many different kind of third party repositories can be mounted. If you have a
separate repository you’ve mounted, all nodes of the cluster will point to this
repository. Your avenue for improving performance at this point is to cluster
your third party repository, using the documentation of the repository you have
chosen. If you don’t have a third party repository, there are ways you can
configure the Liferay repository to perform well in clustered
configuration.
The
main thing to keep in mind is you need to make sure every node of the cluster
has the same access to the file store as every other node. For this reason,
you’ll need to take a look at your store configuration. Here below I mentioned
some default store that Liferay support and its pros and cons to use it.
·
File System Store
·
Advanced File System Store
·
CMIS Store
·
S3 Store (Amazon Simple Storage service)
·
Documentum Store
·
JCR Store
Now we will look into the detail of each and every store and
its implementation mechanism to configure it with Liferay.
·
File System Store:
This is a default storage mechanism
in Liferay to store documents. It’s a simple file storage implementation that
uses a local folder to store files. You can use the file system in cluster
environment but you’d have to make sure that the folder to which you point the store
can handle things like concurrent requests and file locking. For this reason,
we have to use a Storage Area Network (SAN)
or a clustered file system.
Features:
·
The file system store was the first store
created by Liferay and is heavily bound to the Liferay Database.
·
This Store creates a folder structure based on
primary keys in the Liferay database.
Pros:
·
As you can see, this binds your documents very
closely to Liferay, and may not be exactly what you want. But If you have been
using default settings for a while and need to migrate your documents, then
Liferay provides a migration utility in the control panel in Server Administration -> Data Migration. Using this utility, you can move your
documents easily from one store implementation to another.
Cons:
·
File System store is dependent on the size of
the local operation systems’ file storage size. Sometimes numbers of files which
can be stored in particular folder are too large and heavy and the local file
system doesn’t have enough space to store it.
·
It’s not the perfect match when we have to use
Liferay in Cluster mode, Because if we go with this approach then we have to use SAN or Clustered file system
which cost is too high to manage
·
It’s not fit when the concurrent users are
writing to the file then the File system is not that much efficient to manage
synchronization. It will not internally locking the instance and synchronize
the contents. For this it has to be dependent on third party.
·
Advanced File System Store:
Advanced File system store is
similar to the default file system store, but in file system store, it saves
the file to the local file system- which, of course, could be remote file
system mount. It uses slightly different folder structure to store the file.
Pros:
·
Several operating systems have limitations on
the number of files which can be stored in a particular folder. The advanced file
store overcomes this limitation by programmatically creating a structure that
can expand to millions of files, by alphabetically nesting the files in
folders. This is not only allows more file to be stored, but also improves the
performance as there are less file stored per folder.
Cons:
·
Here also the same rule applied to the advanced
file system store as apply to the default file system store. To cluster this,
you’ll need to point the store to a network mounted file system that all the
nodes can access, and that networked file system needs to support concurrent
requests and file locking. Otherwise you may experience data corruption issues
if two users attempt from two different nodes to write to the same file at the
same time.
·
Advanced file system store doesn’t serve your
needs. For this we have to look for other options.
·
CMIS Store:
CMIS (Content Management
Interoperability Services) is an open standard that allows different content
management systems to inter-operate over the Internet. Specially CMIS defines
an abstraction layer for controlling diverse document management systems and
repositories using web protocols.
Features:
·
CMIS defines a domain model plus web services
and Restful AtomPub bindings that can be used by applications.
·
CMIS provides a common data model covering typed
files and folders with generic properties that can be read.
·
There is a set of services for adding and
retrieving documents.
·
There may be an access control system, a
checkout and version control facility and ability to define generic relations.
·
you can communicate using CMIS using 2 protocols
SOAP and REST to access WSDL using the
AtomPub conventions.
·
This model
is based on common architectures of document management system
Pros:
·
The CMIS specification provides a web services
interface that is program language agnostic (REST or SOAP are implemented in
many languages)
·
The CMIS specification provides a web services
interface that decouples web service and content. So CMIS can be used to access
a historic document repository.
·
There is a facility to mount clustered CMIS
repository by the administrator of the Liferay through the UI.
·
Its best approach fits in Clustered environment as
all Liferay nodes are pointing to your CMIS repository, everything in your
Liferay cluster should be fine, as the CMIS protocol prevents multiple
simultaneous file access from causing data corruption.
·
S3 Store (Amazon Simple Storage service):
Amazon’s is a cloud based storage
solution which you can use with Liferay. Amazon S3 is storage for the internet.
It is designed to make web-scale computing easier for developers.
Amazon S3 provides a simple web
services interface that can be used to store and retrieve any amount of data,
at any time, from anywhere on the web. It gives any developer access to the
same highly scalable, reliable, secure, fast, inexpensive infrastructure that
Amazon uses to run its own global network of web sites. The service aims to
maximize benefits of scale and to pass those benefits on to developers.
Features:
·
Write, read, and delete objects containing from
1 byte to 5 terabytes of data each. The number of objects you can store is
unlimited.
·
Each object is stored in a bucket and retrieved
via a unique, developer-assigned key.
·
A bucket can be stored in one of several
Regions. You can choose a Region to optimize for latency, minimize costs, or
address regulatory requirements. Objects stored in a Region never leave the
Region unless you transfer them out.
·
Authentication mechanisms are provided to ensure
that data is kept secure from unauthorized access. Objects can be made private
or public, and rights can be granted to specific users.
·
Options for secure data upload/download and
encryption of data at rest are provided for additional data protection.
·
Uses standards-based REST and SOAP interfaces
designed to work with any Internet-development toolkit.
·
Built to be flexible so that protocol or
functional layers easily be added. The default download protocol is HTTP. A Bit
Torrent protocol interface is provided to lower costs for high-scale
distribution.
·
Provides functionality to simplify manageability
of data through its lifetime. Includes options for segregating data by buckets,
monitoring and controlling spend and automatically archiving data to even lower
cost storage options.
Pros:
·
The main advantage is easy to set up with
Liferay, When you sign up for the service, Amazon assigns you some unique keys
which link you to your account. In Amazon’s interface, you can create “buckets”
of data optimized by region. Once you’ve created these to your specifications,
all you need to do is declare them in portal-ext.properties:
dl.store.s3.access.key=
dl.store.s3.secret.key=
dl.store.s3.bucket.name=
Cons:
·
Its proprietary product, so its cost is too high
to maintain DMS
·
Documentum Store:
If you have a Liferay Portal EE
license, you have access to the Documentum hook which adds support for
Documentum to Liferay’s Documents and Media Library. For this you have to
install it from Liferay Market place.
This hook doesn’t add an option to
make the Liferay repository into a Documentum repository, as the other store
implementations do. Instead, it gives you the ability to mount Documentum
repositories via the Document and Media library UI.
There’s not really a lot to this;
it’s incredibly easy. Click Add → Repository, and in the form that appears,
choose Documentum as the repository type. After that, give it a name and
specify the Documentum repository and cabinet, and Liferay mounts the
repository for you. That’s really all there is to it.
If all your nodes are pointing to a Documentum repository, you can
cluster Documentum to achieve higher performance.
More information is available here:
http://www.liferay.com/marketplace/-/mp/application/15098914
·
JCR Store:
Liferay is a Content Management
System (CMS) that is rich in features, flexible and easy to learn. The Java
Community Process developed a solution to this trend- the JSR-170 and JSR-283,
also known as Java Content Repository (JCR) API.
The JCR Specification provides a
unified interface that different vendors can implement to meet the needs of
content management system. Application developers, on the other hand, are saved
from learning different propriety APIs, thus, reducing time-to-market. They
just need to learn one API that is compatible with any JSR 170/283 complaint
repository. This framework is not only vendor neutral. It is also not tied to
any particular underlying architecture. The back-end data storage could be a
file system, a WEBDEV repository, an XML backed system or an SQL based
database.
In addition to flexibility, the
Java Content Repository is like a fusion of a database and a file system. Among
the valuable features of this integration are:
§
Support for both structured and unstructured
content.
§
Hierarchical design
§
SQL and/or XPath Query
§
Access control
§
Locking
§
Versioning
§
Full-text search
There are a lot of JCR-compliant
repositories are already available in the market.
Liferay supports as a store the JCR
standard. Using the by- default settings, the JCR store is not very different
from the file system stores.
Using the default settings, the JCR
Store is not very different from the file system stores, except you can use JCR
client to access the files. We can use
any one of the JCR Implementation for this. Right now we have two below
mentioned choices.
·
Jackrabbit
·
JBoss Mode Shape
Let’s understand these two approaches in
detail
Jackrabbit:
§
Jackrabbit is the complete implementation of the
JCR API. The Apache Jackrabbit content repository is a fully conforming
implementation of the content repository for Java technology API (JCR Specified
in JSR 170 and 283).
·
A content repository is a hierarchical content
store with support for structured and unstructured content, full text search,
versioning, transactions, observations, and more.
·
In Liferay by default Jackrabbit is used as a
JCR Store.
·
JSR 170 explicitly allows for numerous different
deployment models, meaning that it is entirely up to the repository
implementation to suggest certain models
·
Here jackrabbit is built to support a variety of
different deployment models, some of the possibilities on how to deploy
jackrabbit will be outlined here.
§
Embedded Mode
§
Standalone Mode.
·
In Liferay by default Jackrabbit is available as
embedded mode in JCR Store. It means we have to just enable some properties in
portal-ext.properties file and configure the repository.xml file of the
jackrabbit to enable jackrabbit in Liferay, which is actually in embedded mode.
Here I will explain you the basic difference of embedded mode and standalone
mode.
·
In Embedded mode of the jackrabbit, when we
start the Liferay the jackrabbit is automatically initialize and destroyed when
we stop the Liferay. While in standalone
mode we have to manually start the jackrabbit server and setup everything
manually there.
·
In Liferay, the jackrabbit is available in
embedded mode , so its working fine when we have are working on just one
Liferay instance but when our Liferay is in cluster mode then we also have to
keep jackrabbit as a shared repository. For this we just have to change the
following property in portal-ext.properties file
jcr.jackrabbit.repository.root=${liferay.home}/data/jackrabbit
Change this property to the shared folder that all the Liferay nodes can
see. We have to do this because if we keep the jackrabbit local to each node
then there is a heavy problem on synchronization of the data. So for that we
keep the shared repository for all the Liferay nodes. For doing this, we have
to create a new configuration file (repository.xml) at this shared location and
also do the changes in all the Liferay nodes and point the repository to this
new location.
·
But there is a major drawback to use this kind
of configuration because of file locking issues, this isn’t the best way to
share jackrabbit resources, unless you are using a networked file system that
can handle concurrency and file locking. If two members logged in at the same
time and try to upload the content, you could encounter the data corruption
using this method. Because of this we don’t recommend to use this
configuration.
·
Here I would recommend that if your Liferay is in a cluster mode than use
JCR in a cluster, you should redirect jackrabbit into your database of choice.
You can use Liferay’s database or any other database of your choice for this
purpose. For this you just have to change the configuration file(
repository.xml) and point to the database.
·
If your Liferay is in a cluster mode then every
node contains the jackrabbit configuration and on every node one repository.xml
file is there, you have to make the changes in the same file and point to the
database of your choice.
·
Once you have configured Jackrabbit to store its
repository in a database, the next time when you bring up the Liferay, the
necessary database tables are created automatically. Jackrabbit does not
created indexes on the table values by itself, so you have to manually index
the primary key of the tables or write your logic to index the values
automatically.
·
It is the best approach for storing the
documents in a database then file system like Advanced File system store
because here you get the benefit of the clustering also.
·
One major advantage is that when you upgrade
Liferay from lower version to upper version, Liferay itself provide to support
to upgrade your jackrabbit configuration because it provided by Liferay itself.
JBoss Mode Shape:
·
Mode Shape is a distributed, hierarchical,
transactional, and consistent data store with support for queries, full–text
search, events, versioning, references and flexible and dynamic schemas. It is very fast, highly available, extremely
scalable, and it is 100% open source and written in java.
·
Mode Shape is perfect for data that is organized
in a tree-like hierarchical structure where related data is stored close
together, where navigation to related content is just as common and important as
fast key based lookups or queries. The hierarchical organization is similar to
a file system, making ModeShape a natural for storing files annotated with
metadata. ModeShape can even automatically extract the structured information
within the files so that clients can navigate or use typed queries to find
files satisfying complex, structurally-oriented criteria. ModeShape is an
excellent store for data with a complex schema, since the schema can vary over
the database and evolve over time. ModeShape is the perfect distributed data
store for all kinds of applications, including repositories, content management
systems, historical data services, provisioning and governance systems, and
metadata management systems.
·
Mode Shape supports all JCR 2.0 required features:
o
Repository acquisition
o
Authentication
o
Reading/navigating
o
Query
o
Export
o
Node type discovery
o
Permissions and capability checking.
·
And most of the JCR 2.0 optional features:
o
Writing
o
Import
o
Observation
o
Workspace management
o
Versioning
o
Locking
o
Node type management
o
Same –name siblings
o
Orderable child nodes
o
Shareable nodes
·
Mode Shape is an open source implementation of
the JCR 2.0 API and thus behaves
like a regular JCR repository. Applications can search, query, navigate,
change, version, listen for changes, etc. Mode Shape can store the content in a
variety of back-end stores (including relation databases, Infinispan data
grids, JBoss Cache, etc.), or it can access and update existing content from
“other” kinds of systems (including file systems, SVN repositories, JDBC
database metadata, and other JCR repositories).
·
In Mode Shape, most of the times the data is
organized using the following way.
·
Each JCR node contains the following elements :
o
Name path and identifier
o
Properties (name and values)
o
Child nodes
o
One or more Node Type.
·
Features
of the Mode Shape:
o
All data is organized in a hierarchical
tree-like structure of nodes, single- and multi-valued properties, and
children.
o
All data is cached and stored in Infinispan, which can persist data
on the file system, in databases, in the cloud, and even distributed in-memory
across a data grid
o
Cluster to distribute/replicate data across
multiple machines, and even keep most/all of it in-memory to form a data-grid
with extremely fast access (faster than from local disk)
o
Implements the JSR-283 standard Java API for
content repositories (aka, JCR 2.0)
o
Define a schema with node types and mixins that
(optionally) limit the properties and children for various kinds of nodes, and
evolve the schema over time without having to migrate the data.
o
Use multiple query languages, including
SQL-like, XPath, and full-text search languages to find data.
o
Use sessions to create and validate large
amounts of content transiently, and then save all changes with one call.
o
ModeShape can be configured to use and
participate in JTA transactions.
o
Register to be notified with events when data is
changed anywhere in the cluster, optionally filtered by custom criteria.
o
Segregate data into multiple repositories and
workspaces.
o
Embed ModeShape into your Java SE, EE, or web
applications.
o
Install into JBoss AS7 and applications to
centrally configure, manage, and monitor repositories.
·
Mode Shape can be embedded into your standalone
and web applications, or installed and run as a service in JBoss AS 5.x or 6.x
·
ModeShape is a JCR 2.0 implementation that
supports all of JCR 2.0 required features: repository acquisition,
authentication, reading/navigating, query, export, node type discovery, and
permissions and capability checking.
·
ModeShape also implements most of the optional
JCR 2.0 features: writing, import, observation, workspace management,
versioning, locking, node type management, same-name siblings, shareable nodes,
and orderable child nodes.
·
Mode
Shape in Java EE application:
o
Till now we just cover the functionality and
features of Mode Shape. Now we talk about how we can use it in Java EE
application and Liferay portal.
o
Mode Shape makes it easy to use JCR repositories
within Web and Java EE applications deployed to virtually any web or
application server.
o
Mode Shape is very small and light weight enough
that you can very easily embed it into your own Java SE applications. And doing
so is remarkably easy. The only thing
that you determine is how much control and management your application will
need to have over the Mode Shape repositories. On One hand, if your application
needs just to look up and use one or more JCR Repository instances, then it
could use the JCR API or on the
other hand, your application may need more control over dynamically deploying,
monitoring, changing configuration, and undeploying individual repositories. In
this case, your application can use the ModeShape
–Specific API.
o
For most part, the best way to use Mode Shape
within a web application deployed to Tomcat, Glassfish or other containers or
application servers is to simply embed
it into your web application. At that point, it should be very similar to Mode
Shape in Java Application.
o
If you have a several web apps that share the
same Mode Shape repositories, embedding Mode Shape into each and using the same
configuration files should work or you could create a single web app to manage
the Mode Shape repositories.
·
Mode
Shape in Liferay application :
o
In Liferay mainly we need to integrate Mode
Shape to manage DMS (Document Management System). Documents that we uploaded
from Document and Media Library System.
o
There are 2 ways to integrate ModeShape in
Liferay, One is embed the mode shape by overriding the basic JCR implementation
code(JCR API) that is available in Liferay and another way is use Mode shape as
a standalone mode and communicates it using different protocols like CMIS, REST
API, WebDev, etc.
o
If you are using Liferay instance alone without
cluster then it is advisable the use ModeShape in embedded mode. It is possible
to integrate mode shape in Tomcat. See this section
for more information.
o
Now we have to look for more scenarios where our
Liferay is in a cluster mode then we have to think from following scenarios
§
We can embed the Mode Shape inside your Liferay
portal but there is big problem on Data Synchronization.
§
Mode shape still is in Clustered mode in the
same way (via JGroups), and the data is stored in Infinispan (which also needs
to be clusted).
§
Mode Shape uses transactions, so concurrent
writes are absolutely possible without global write locks (as is the limitation
of jackrabbit). Please see this section
for more information. Here we rely upon Infinispan support for and use of
transactions to help make this possible.
§
Mode Shape also supports serialization means if
you have multiple JCR Sessions in the same process or spread across your
cluster) that are regularly writing/updating the same node and saving at the
same time, then all of these changes will be serialized, one of them will block
others. But most of time the multiple JCR sessions will be updating different
nodes, in which case there is no blocking.
§
Now to use the Mode shape there are several way
to use it like,
·
if you
want to use JCR API in your application then you need to run Mode Shape within
the same process(es). So if you're
running as regular Java SE applications, your application would instantiate and
starts the ModeShapeEngine.
If your applications are deployed to a web server (e.g., Tomcat), then you can
either embed ModeShape inside your singular web app or have the web server run
ModeShape (e.g., in Tomcat via the "server.xml" file) have have your
application(s) look up the repositories in JNDI. See how Mode Shape can be clustered here.
·
If you want to use REST API or WebDAV, then your
application will simply access the server using our REST
API or the WebDAV protocol. If you’re going to use CMIS, then your
application would use the CMIS REST API or CMIS client application framework. But all of these are remote protocols will be
less efficient than using the JCR API from within the same process, simply
because they require network communication. Mode Shape didn’t have a good
implantation and support for all this network protocol as they build for just
basic purpose so it will not handle such complex scenarios like heavy write or
multiple users writing at the same time.
§
To Configure Mode Shape in our application, we
have to use JSON format, we can’t use XML format as they are not supported yet.
§
Mode Shape internally uses Infinispan to store
its backend Data.
§
To Cluster the Liferay, we use mode shape as a
DMS then we have to configure Mode Shape configuration in JSON format on each
and every Liferay nodes and we can also cluster the Mode Shape based on our
requirement. As Mode Shape use Infinispan and Infinispan supports clustering at
different level. Then we can use mode shape in cluster using Infinispan.
o
Mode Shape also have the ability to access
external and internal data in exactly the same way as if it were stored in one
place is what we call federation.
o
Currently Mode Shape is dependent on Infinispan
and Jgroups.
o
Mode Shape also provides a number of connectors
out-of-the-box . These are ready to be used by simply including them in the
classpath and configuring
them as a repository source.
·
Summary:
This document is mainly created to
maintain Liferay’s DMS (Document Management System). As there are several ways
available by default then we can use that different ways based on our
requirement. Like File System is available by default and we are normally using
the same.
When we think about the scalability
perspective, then there are several other factors that we have to think like
Clustering, Synchronizing, version, concurrent writes, performance,
authentication, and permissions, export mechanism, etc.
To improve the performance, we will
do the clustering of our Liferay nodes then we also have to think about the
configuration of the DMS as there are multiple Liferay nodes which are
communicating to DMS using different ways then we have to think which DMS to
use.
·
I will clearly say that when Liferay is in
cluster mode then File system, Advanced File system, are not suggested as there
is a heavy cost for SAN.
·
If you are already integration your application
to any ECM then I would suggest use CMIS store to communicate to ECM. Liferay
provides the easy steps to use CMIS to connect to Alfresco ECM.
·
If you don’t have an issue for proprietary
product which cost is too high to manage the data then I would suggest use S3
Store which keeps your data in cloud.
·
Documentum is provided by Liferay EE plugin only
so we can look for that option but this is also a proprietary product and limitation
is there in terms of scalability.
·
If you want to save your cost and easily
integrate DMS with your application in cluster mode then I would suggest use
Jackrabbit (JCR Implementation) to store the documents in Database instead of
File system because it performs well in clustering but only two loop falls are
there, one is that we have to manually index the data in the database tables
and the other is it will not support the concurrent writing mechanism in
cluster mode. One advantage to use this when the Liferay version upgrade to
newer version, Liferay provides the support for upgrading the jackrabbit also.
·
We can use the Mode Shape as DMS which is one of
the JCR implementation which is good in compare to Jackrabbit but the only
problem is Liferay itself is not provided it so we cannot get the support in
future from Liferay in upgrading procedure. When our application is in cluster
mode then we can use Mode shape in cluster mode because Mode Shape provides all
the functionality for JCR 2.0 required and optional features, on added to this
it also supports the concurrent
writes, which is not supported in Jackrabbit. But if we have to use JCR API
which is most powerful API then all other, then we have to use Mode Shape in
embedded in your own web server cluster, if you are using a separate cluster of
servers to use Mode Shape cluster then you have to use REST API or WebDAV or
CMIS, but of these we cannot use any one because Mode shape is not supported
all its functionality using these APIs, if we still have to use Mode Shape in
separate cluster of servers then we have to implement our own REST API Service
on top of Mode Shape that is deployed with the Mode shape cluster. The benefit
is that you can size the ModeShape cluster for throughput/load separately from
your application; the disadvantage is the additional network overhead.
No comments:
Post a Comment
Note: Only a member of this blog may post a comment.