Thursday, March 18, 2010

appengine features

Currently an appengine entity is a bigtable/datastore row containing the serialized entity encoded by google's protocol buffers. Every property is indexed whether or not it will ever be queried. An entity reference is inter-row which means entities are heavyweight.

Existing options for representing a Contact:
1a) Contact entity w/ ListProperty's: addresses, emails, etc.
1b) Contact entity w/ address1, address2, address3, email1, email2, email3, etc. properties.
2a) Contact entity w/ addresses property listing keys of Address entities. This is a workaround because lists of ReferenceProperty is unsupported.

I think it would be useful to have lightweight entities which can be embedded in the classic entities. This would enable more options for Contact:
2b) Contact entity w/ addresses property listing NestedReferenceProperty referencing Address nested entities.
2c) Contact entity containing Address nested entities--query by ancestor.

Protocol buffers support nesting & repeating but not arbitrary graphs. Therefore, an implementation based on protocol buffers would restrict expressiveness or face that challenge.

Possible API additions:
  • constructor: new Model(nest=parent)
  • property type: NestedReferenceProperty; supported in lists.
  • custom index support.
It would speed entity update & reduce space if properties could be excluded from indexing.

entity extraction & NLP APIs

A wikipedia article on entity extraction. A useful survey of actual use of APIs.

opencalais is from Reuters.
$2,000 / (100K queries * 30days) = .066 cents per query
One must prepay $24K for the year. The daily cap is 100K queries & 20/sec.
Tried the firefox addon. For the ag2.0 agenda, nba.yahoo.com, & IGN.com, the results are poor.

gate is from U of Sheffield. It is more focused on NLP or language engineering.