3.0 Activities Design

From AlfrescoWiki

Jump to: navigation, search

DRAFT/WIP

Contents

[edit] Activities High-level Design Approach

The following provides a high-level design approach to support the 3.0 Activities Requirements.

[edit] Split & Parallelize

  • keep processing as light as possible, but allow for the pre-calculation of user feeds to be split across 'n' CPUs
    • maintain a running list of activity posts (continuously added to)
  • background "feed" job activates on a regular cycle
    • evenly allocates posts to 'n' "feed" tasks (one per CPU)
    • each "feed" task processes its allocation of posts and generates activities for relevant users/sites - posts are marked as processed
  • background cleaner jobs activate on regular cycles (or during the night)
    • "feed cleaner" job removes feeds that are out of date and/or possibly user/site feeds that are greater than a system max size
    • "post cleaner" job removes posts that are processed - could be kept for period of time to aid debug and/or troubleshooting
  • background "post lookup" job activates on a regular cycle
    • to provide secondary lookup of activity data for a well-defined entity (eg. node ref)

[edit] Data Schema

[edit] Activity Post

sequence id (pk)
posting userid
site network
app tool
post date
activity type
activity data
job task node
status
last modified
  • "sequence id" is an incrementing sequence used for limiting the posts processed by a task (while new posts are continuously added)
  • "posting userid" originating user who posted this activity
  • "site network" site id context
  • "app tool" app id context
    • NOTE: if not the site name then may need the site name in the activity data to generate certain feed views
  • "post date" date+time when activity raised (posted)
  • "activity type" named type
  • "activity data" JSON format, so that it can be converted to Freemarker model - in order to apply activity templates
  • "job task node" node hash - is used to partition and allocate posts to a job task node
    • calculated as mod("posting user id".hash(), no. of task CPUs)
    • storage may also be partitioned by job task (e.g. sql table partition)
      • NOTE: this is pre-calculated on post to simplify & improve performance of query - however, the number of actual CPUs may vary, if task nodes are added, removed or die during the post period. This is ok, it may mean that some CPUs are not used, or at worst, some CPUs have more work than others. The number of available CPUs can be retrieved at the end of each job cycle.
  • "status" can be immediately posted or pending an additional lookup, once posted and processed then eligible for cleanup - transitions are (PENDING ->) POSTED -> PROCESSED
  • "last modified" for debug/troubleshooting only, set to post date when inserted, then updated when status changes

[edit] Activity Feed

feed userid
posting userid
site network
app tool
post date
activity type
activity summary
activity format
id (pk)
feed date
post id
  • "feed userid" may be used to partition to support parallel user feed queries, userid can also be a site id (for site activities feed)
  • "posting userid" originating user who posted this activity
  • "site network" site id context
  • "app tool" app id context
  • "post date" date+time of activity
  • "activity type" named type
  • "activity summary" generated activity summary, can also be pass-though of JSON activity data
  • "activity format" format of activity summary, eg. atom, html, json ...
  • "id" DB-generated PK, for debug/troubleshooting only
  • "feed date" for debug/troubleshooting only - date+time when feed generated, as opposed to post date
  • "post id" for debug/troubleshooting only - not a FK, can dangle when posts are cleaned, might be used to implement re-generate

[edit] Activity Feed Control

feed userid
site network
app tool
last modified
  • "feed userid" feed user can have zero or more opt-out feed controls
    • NOTE: userid can also be a site id (feed controls for site activities feed, set by a site admin)
  • "site network" site id - if set, opt out for this site
  • "app tool" app id - if set, opt out for this app tool
    • NOTE: can combine with site - ie. opt-out of app tool for given site
  • "last modified" for debug/troubleshooting only, set when inserted

NOTE: in future release, could add "activity type", "posting userid" etc

[edit] Posting an Activity

  • creation of an activity post - fast (eg. insert row), possibly asynchronous?? - handle tx error/rollback
  • thread pool?
  • only post in accordance with posting user privacy controls

[edit] Feed Generator

[edit] Feed Job

  • simple task scheduler - job initiator
  • scheduled job - eg. run every X minutes (if not already busy) - should probably be less than 10 minutes, to keep feeds reasonably up-to-date
select max(sequence id), job task node from post group by job task node
for each job task node
  start feed task(job task node, max(sequence id))
end for
  • tuning parameters
    • frequency of job cycle
    • number of posts processed by each task (NOTE: this throttles the processing, but may result in a growing list of posts - bad - need more CPUs in that case)
  • cluster-aware, to avoid contention

[edit] Feed Task

  • simple activity generator
get activities
- select posting userid, site network, app tool, activity type, activity data, post date
- from activity post
- where job task node = [job task id]
- and sequence id < [job max sequence id]
- order by posting userid

for each activity post
  
  get posting user connections for posting user and site network (repository callback)
  get activity type summary templates (repository callback)
  
  add activity site network to the set of user connections (for site activities feed)
  for each user connection
    
    get user connection feed controls
    - select site network, app tool
    - from activity feed control
    - where feed userid = [user connection]
    
    if accepting activity (not in opt-out list)
      for activity type summary template
        render activity summary and create activity feed entry
      end for
    end if
  end for
end for

[edit] Feed Retriever

  • simple feed retriever to get personal activities feed
  • can also be optionally parameterised - eg. by site network, posting userid, activity type ...
  • retrieve user activities feed
select *
from activity feed
where feed userid = [userid]
and posting userid != [userid]
order by post date desc

NOTE: userid can also be a siteid, for a site activites feed (TODO - confirm permitted max len for userid and/or siteid)

[edit] Feed Cleaner

  • simple feed cleaner to clean (transient) activity feed entries
  • scheduled job - eg. run every X hours or at a particular time during the night
  • maximum age depends on business requirements, usage, available storage - eg. Y weeks
delete from activity feed
where post date < [keep date]
  • keep date = time now - maximum age

[edit] Post Cleaner

  • simple post cleaner to clean (transient) activity posts that have been processed (ie. feed has been generated)
  • scheduled job - eg. every M hours or at a particular time during the night
  • maximum age depends on debug requirements, usage, available storage - eg. N hours
delete from activity post
where post date < [keep date]
and status = 'PROCESSED'
  • keep date = time now - maximum age

[edit] Post Lookup

  • simple secondary lookup for posts that are pending additional activity data
  • scheduled job - eg. every X minutes (should be less than the Feed Job)
select sequence id, activity data
from activity post
where status = 'PENDING'
for each activity post
  get node ref from pending activity
  get additional activity data for a pending node ref (repository callback)
  update activity post set activity data = [updated activity data], status = 'PROCESSED'
  where post id = [pending post id]
end for

[edit] Alfresco Repository Callbacks

  • repository callbacks are specifically designed for feed task interaction (not public APIs)
  • require heavy caching (otherwise could be bottleneck for parallelized tasks)
  • HTTP API (perhaps with web cache on repository or feed task node)
  • get additional activity data for a pending activity post(node ref)
    • return additional activity data given a unique reference (eg. nodeRef)
  • get posting user connections(userid, network context)
    • returns distinct (direct friends + network members)
  • get activity type summary templates(activity type)
    • returns template text for each registered template

[edit] Activity Type Templates

Activity Type Summary Templates:

  • template engine
    • FreeMarker
    • will be distributed to the grid nodes, if running on a grid
    • can access data model, repository root objects are not available
  • data model
    • activity data converted from JSON to Freemarker Data Model, plus:
    • userId (of posting user, eg. jsmith123)
    • firstName (of posting user, eg. John)
    • lastName (of posting user, eg. Smith)
    • name (name of item/node, eg. my spreadsheet.xls)
    • siteNetwork (site short name. eg. Project X)
    • xmldate(date) (posting date in ISO8601 format, eg. )
    • displayPath (TBC - folder path of node, if applicable, eg.)
    • nodeRef (if applicable, eg. workspace://SpacesStore/303e8e2c-1b81-11dd-8d14-97d3290659e2)
    • activityType (activity type id, eg. org.alfresco.member.joined)
    • repoEndPoint (repository endpoint, eg. http://123.456.8.9:8080/alfresco)
  • template name
    • <base activity type>.<format>.ftl, eg. create.atomentry.ftl
  • stored in data dictionary
    • Company Home -> Data Dictionary -> Activity Templates
    • stored in namespace hierarchy, eg. reserved namespace is org.alfresco which maps to Company Home -> Data Dictionary -> Activity Templates -> org -> alfresco
  • simple fallback mechanism
    • if 'org.alfresco.folder.create' does not exist then will fallback to 'org.alfresco.create', or if this does not exist then 'org.alfresco.generic'

[edit] Implementation Choices

[edit] Audit Trail

Discarded since the amount of re-use is likely to be minimal at this level. Also the detailed requirements/use-cases for fine-grained repository-driven audit events are different to those for coarse-grained application-driven activity events. For example, audit trails are typically required to be more long-lived compared to the more transitory nature of activity feeds.

The current audit mechanism is used by the repository to provide a low-level audit trail, using audit interceptors for public API methods. The audit mechanism also provides a simple interface for applications to set arbitrary custom audit events. In theory, one might consider using this custom audit trail in lieu of the activity posts,. However, this would also require enhancements to provide additional features to enable grid-based processing, including non-Hibernate data access layer, option to delete audit entries etc. The

[edit] Hadoop/HBase

Discarded due to where the project is in its lifetime. Also, concerns over single master node, HBase reliability and black box algorithms, data stores. Requires much more research to understand fully before relying on in production systems.

[edit] DB, Processing Grid

Currently, prototyping with MySQL & GridGain and/or JPPF. Provides us much more control over implementation.

Tasks are distributed, hence, in addition to the actual task context, the associated class dependencies would need to be either installed at each grid node or ideally distributable via a distributed/p2p classloader:

  • third-party libs
    • JDBC driver
    • SQL mapping layer (eg. iBatis)
    • Freemarker engine

Can also optionally plug-in a local implementation by default, which can then be re-configured to a grid implementation, as needed.