$Id$ -*- Text -*-

Python RPKI production tools.

Requires Python 2.5.

External Python packages required:

- lxml, which in turn requires the libxml2 C libraries.

  http://codespeak.net/lxml/

  FreeBSD: /usr/ports/devel/py-lxml

- MySQLdb, which in turn requires MySQL client and server.  I'm
  testing with MySQL 5.1.

  http://sourceforge.net/projects/mysql-python/

  FreeBSD: /usr/ports/databases/py-MySQLdb

- TLSLite, which pulls in other crypto packages.

  http://trevp.net/tlslite/

  FreeBSD: /usr/ports/security/py-tlslite

- Cryptlib, at the moment just to support TLSlite but may end up using
  it for other things later.

  http://www.cs.auckland.ac.nz/~pgut001/cryptlib/

  FreeBSD: /usr/ports/security/cryptlib

  ...but the FreeBSD port doesn't (yet?) install the Python bindings,
  sigh, so at the moment you have to do that by hand:

  # cd /usr/ports/security/cryptlib
  # make install
  # cd work/bindings
  # python setup.py install
  # cd ../..
  # make clean

- Eventually I expect that this will require an event-handling package
  like Twisted, but I'm not there yet.

- The testpoke tool (up-down protocol command line test client) and
  testbed tools also uses PyYAML.

  http://pyyaml.org/

  FreeBSD: /usr/ports/devel/py-yaml

We also use a hacked copy of the Python OpenSSL Wrappers (POW)
package, but our copy has enough modifications that it's expanded in
the Subversion tree.  Depending on how this all works out, I may end
up splitting the POW.pkix module out of the POW package and using it
with Cryptlib, as the POW.pkix package is 98% about doing ASN.1 in
pure Python and only 2% about any kind of crypto.



$Revision$

TO DO:

- Scripted tests to grow and shrink and revoke and ....  See
  testbed.*.yaml, but more systematic testing needed.

  PRIORITY: Required

  TIME REQUIRED: open-ended

  STATUS: Ongoing

- Randy's "user validation tool" (fetch and validate certs and
  probably the ROA for a prefix I want to accept in a route filter I
  am building in Python/Perl).  This probably uses rcync's output as
  one of its inputs.

  This is a basic tool for a sysadmin who wants to -use- all this crud
  we're working so hard to generate.  It's not required for the
  generation tools to work, but without it the entire toolset does
  nothing obviously useful, which will make it a very hard sell during
  the limited public test stage.

  PRIORITY: Required

  DEPENDS ON: ROA generation

  TIME REQUIRED: three days

  STATUS: Not started

- Common protocol dump format with APNIC and other implementors so we
  can read each other's dumps.  "Obvious" format would be an
  OpenSSL-style PEM of the CMS, with a "text" portion (the place where
  "openssl x509 -text" would put a text dump of a cert) showing the
  wrapped XML.

  PRIORITY: Desirable

  TIME REQUIRED: one day

  STATUS: Not started

- Clean unused cruft out of left-right protocol, or at least have
  control booleans we don't intend to implement at present signal an
  error if used.

  Bottleneck here has been deciding what to punt and what to
  implement.  Removing unused booleans or raising errors when they're
  used is trivial.

  PRIORITY: Required

  TIME REQUIRED: Less than one day

  STATUS: Error signalling done

- Publication protocol and implementation thereof.   Protocol design
  started, Randy had comments that sent me back to the drawing board
  (he was right).  Next step is to integrate Randy's advice, which
  probably means picking up more of the left-right protocol framework.

  Desirable although not strictly required that protcol be agreed upon
  among the RIRs.  Might not be practical given how long it takes
  group to decide anything.

  Tricky bit is making sure that repository receives enough
  information to know whether parent has authorized child to use
  parent's namespace in nesting case.   In theory this is
  straightforward but requires careful checking.

  ARIN can't host output of non-hosted RPKI engines without this, and
  that's critical both to the security model as discussed with ARIN
  staff in late 2006, so I believe we need this capability even as
  part of the initial limited test.

  PRIORITY: Required

  TIME REQUIRED: 1-2 weeks for implementation once protocol settled,
  depending on how much of the protocol and implementation I can steal
  from the existing left-right protocol.

  STATUS: Started

- Subsetting (req_* attributes in up-down protocol)

  Minimal implementation would be to recognize this as correct
  protocol and signal an internal server error if it's ever used.

  More serious implementation would require expanding SQL child_cert
  table to hold subset masks and tweaking almost every bit of code
  that touches that table.

  PRIORITY: Required

  TIME REQUIRED (minimal version): One day

  TIME REQUIRED (real version): 1-2 weeks

  STATUS: Not started

- Error handling: make sure that exceptions map correctly to up-down
  error codes, flesh out left-right error codes.  Note that the same
  exception may produce different error codes depending on which
  up-down PDU we're processing (sigh).

  Will require code audit for coherency.

  PRIORITY: Required

  TIME REQUIRED: four days

  DEPENDS ON: almost everything else, as almost any code change can
  raise new exceptions that we'd need to handle.

  STATUS: Not started

- db.commit(), db.rollback(), code audit for data integrity issues,
  fix any data integrity issues that turn up.

  Among other issues, we need to handle loss of connnection to
  database server and other MySQL errors.  MySQLdb throws an
  exception, which we can catch, and retrying is easy enough, but need
  to be careful about recovery action depending on whether we had
  uncommitted changes.

  PRIORITY: Required

  TIME REQUIRED (commit and rollback): Two weeks

  TIME REQUIRED (data integrity audit): Three days

  TIME REQUIRED (fix data integrity): Unknown, depends on code audit
  and results of runtime testing.

  DEPENDS ON: async tasking model, sort of -- could do it first, but
  tasking change will affect the exception handling that triggers
  rollback.

  STATUS: Not started

- Test with larger data set -- Tim gave me plenty of data, I have the
  low-level tools and the glue logic to create child objects for all
  the entities in the IRDB, but I don't yet have logic to poll on
  behalf of each of them and check result for sanity.

  Maybe it'd be easier to write something that dumps Tim's database in
  YAML format for testbed.py to chew on?

  PRIORITY: Highly desirable

  TIME REQUIRED (setup): One day to convert Tim's data to YAML

  TIME REQUIRED (testing): Unknown, depends on what we turn up

  STATUS: Not started

- Clean up rootd.py to be usable in a production system.   Most urgent
  issue is handling of private keys.   May not need much else, as this
  is not a high-traffic server.

  PRIORITY: Highly desirable (not strictly needed for limited testing)

  TIME REQUIRED: Two days

  STATUS: Not started

- Test framework, multiple self-instances per engine-instance (single
  self-instance per engine-instance is already done).

  PRIORITY: Required

  DEPENDS ON: async tasking model.

  TIME REQUIRED: One week

  STATUS: Not started

- tlslite code seems flakey under heavy use, and doesn't support all
  the cert checks we want.  Best bet for getting this right is
  probably to hack on the POW Ssl class until it supports everything
  shown in the OpenSSL book; aside from speed, the main advantage here
  is that there -is- a list of all the things one needs to do to use
  TLS properly if one follows this recipe, whereas with TLSlite it's
  all a mystery.

  Useful side effect of doing this via POW: it brings us back to only
  needing one crypto library (in particular it lets us punt M2Crypto,
  which appears to be coded as an accident waiting to happen).

  PRIORITY: Required (cert checking is a security issue).

  TIME REQUIRED: Two weeks.

  DEPENDS ON: Async tasking model.

  STATUS: Not started

- ROA generation.  We have a bunch of the primitives for this but we
  aren't yet generating the ROAs themselves.

  For reasons that presumably made sense at the time, the left-right
  protocol for route_origin objects allows ranges as well as prefixes,
  and the SQL for stores everything as ranges, which is nice and
  general...except that ROAs can only hold prefixes.  So left-right
  should only allow prefixes, and SQL should only store prefixes.

  PRIORITY: Required

  TIME REQUIRED: Three days

  STATUS: Not started

- Make rpkid fully event-driven (async tasking model), except for SQL
  queries.  This probably involves the "twisted" framework.

  PRIORITY: Required (to implement hosting model)

  TIME REQUIRED: one week.

  STATUS: Not started

- Update biz trust anchor model to what we came up with in Amsterdam.
  This was a direct result of security review by Kent and Housley.

  This has been waiting for work we hope RobK is doing.  This is
  probably not a lot of coding, probably a few extra cert fields in
  the self object which we then need to toss into the
  rpki.x509.X509_chain objects before verifying CMS or TLS, and
  perhaps the existing TA fields in various objects become pairs of
  certs instead of a single TA, but this is mostly just generalization
  and reuse of existing code, no bold new adventures.

  PRIORITY: Required (security issue)

  TIME REQUIRED: One week.

  STATUS: Not started

- Performance testing

  STATUS: Not started

- rcynic handling of RPKI trust anchors probably needs updating.
  Discussions over last N months of how RPKI trust anchors work, how
  we package them, and how we roll them over.  The last (TA rollover)
  is the driver for this.

  Last I recall (need to check email archives) APNIC had proposed a
  relatively simple format (CMS signed PEM-encoded X.509 object set,
  or something like that).  Need to do analysis to make sure this is
  adaquate for our needs, if so just use it.  This would involve minor
  changes to rcynic.

  Alternatively, this could be a separate program to keep this grot
  out of rcynic itself, but that's probably a usability nightmare.

  PRIORITY: Required (usability issue for relying parties)

  TIME REQUIRED: Three days.

  STATUS: Not started

- rcynic does not yet handle manifests.  This is both a real problem
  (manifests were added to plug a security hole) and a user acceptance
  problem (without manifest support rcynic checks old certs that are
  supposed to fail because they've been revoked, resulting in what
  appear to be spurious errors, which just annoy the user).

  PRIORITY: Required

  TIME REQUIRED: One week.

  STATUS: Not started

- Update internals docs (Doxygen).   Mostly this means updating
  function comments in the Python code, as the rest is automatic.  May
  require a bit of overview text to explain the workings of the code,
  this overview text may well turn out to be just the current flat
  text documents marked up for inclusion by Doxygen.

  PRIORITY: Desirable

  TIME REQUIRED: Two days

  STATUS: Ongoing

- Reorganize code (directory names, module names, which objects are in
  which modules, add gctx pointers to objects so we can stop passing
  all these flipping explicit gctx pointers in almost every function
  call) to make it easier to understand and maintain.  Portions of the
  existing code were done in extreme haste to meet testing deadlines,
  and it shows.

  STATUS: File renaming mostly done, other stuff not started

  TIME REQUIRED: two days

  PRIORITY: Highly desirable (to preserve programmers' and
  maintainers' sanity, if nothing else)

- Add HSM support.  Architecture includes it, current code does not.
  First step here would be talking to somebody who understands PKCS#11
  better than I do, ie, Richard Lamb or Francis Dupont.

  STATUS: Not started

  TIME REQUIRED: Unknown

  PRIORITY: Desirable.  Am guessing ARIN does not require this for
  initial test

- Installation packaging, so that rpkid can be installed like a normal
  package.

  STATUS: Not started

  TIME REQUIRED: Three days

  PRIORITY: Desirable

- Tighten up syntax checking in left-right schema.

  STATUS: Not started

  TIME REQUIRED: Less than a day

  PRIORITY: Desirable

- Rethink exposing SQL primary indices in protocols.  Right now, we
  use autoincremented SQL indices in many places in the left-right
  protocol, and they're even expose in a few places in our
  implementation of the up-down protocol.  This is nice and unique but
  may be operationally fragile, since up-down usage means that URLs
  contain mechanically assigned identifiers rather than an identifier
  negotiated between the two parties during contract setup.

  RobK suggested that we should instead use something like a hash of
  the client's name, which would be probabilistically unique, would
  not expose information, but would be stable even if we had to
  rebuild the database.

  STATUS: Not started

  TIME REQUIRED: Two or three days to evaluate.  Implementation time
  if we decide to make a change unknown, but probably on the order of
  a few days.

  PRIORITY: Rethinking desirable; reworking unknown



Things implemented but not yet tested. 

- Client side of expiration now assumes that parent will reissue
  when its IRDB changes.

- Parent side of revocation (child_cert objects) and CRL generation
  implemented.

- Parent side of expiration implemented.

- Child batch processing loop: regeneration or removal of expired
  certs based on what's in the IRDB.

- Batch regeneration of CRLs and manifests for all CAs.

- Protection against up-down operations specifying a class_name that
  belongs to some other self context.

- Rewrote code that handles revoke on shrink to revoke -all- old certs
  for that key, not just most recent.  Not certain, but this may have
  been the cause of a cert dropping not showing up in the CRL during
  testing with APNIC in Vancouver.

- Kludgy local publication hack seems to work now, including
  withdrawal.  rcynic still whines occasionally, but I think that's
  just because, without manifest support, rcynic has no way of telling
  the difference between certs we withdrew on purpose and certs that
  were removed by an attacker, so the first rcynic run after a cert
  has been revoked pulls the old cert from the previous rcynic pass,
  find that it's listed in the CRL, and whines about it.



Other random notes:

Being able to specify interaction with other servers (not running
under testbed) in a testbed.yaml might be useful for interop tests.
Kind of breaks testbed's fundamental model, though.  Replacing what
testbed thinks is a leaf with somebody else would be easy, so maybe we
could specify some way to hang a bunch of rpkids under an external
parent?  Hmm, data needed would look a lot like testpoke.yaml, maybe
we can reuse some of that language?

There's a three-way tradeoff lurking in the publication protocol,
manifest generation, and CRL generation:

1) Consistancy issues for relying parties (eg, don't want to withdraw
   something that's still listed in the manifest);

2) Efficiency issues for the RPKI engine (eg, generating a new
   manifest for each individual change during a batch run could be
   expensive, would prefer to batch up the changes into a single
   manifest run); and

3) Coherency issues for the RPKI engine (don't want to defer things
   that could result in loss of state if something bad happens).

Considerations (1) and (3) have to dominate, which may mean we take a
hit on (2).

Most of the explicit calls to sql_fetch*() are now encapsulated in
one-line methods.  The remaining ones are probably hints at minor bits
of abstraction still to be done.

Biz certs currently used by test scripts don't include SKI or AKI.  I
think this is because the test scripts use "openssl x509" rather than
"openssl ca" when generating these certs.  Not critical, and will
probably become completely irrelevant with all-singing all-dancing
post-Amsterdam biz cert scripts, but should not be a big problem to
fix either if it gets in the way again.