$Id$ -*- Text -*- Python RPKI production tools. Requires Python 2.5. External Python packages required: - lxml, which in turn requires the libxml2 C libraries. http://codespeak.net/lxml/ FreeBSD: /usr/ports/devel/py-lxml - MySQLdb, which in turn requires MySQL client and server. I'm testing with MySQL 5.1. http://sourceforge.net/projects/mysql-python/ FreeBSD: /usr/ports/databases/py-MySQLdb - TLSLite, which pulls in other crypto packages. http://trevp.net/tlslite/ FreeBSD: /usr/ports/security/py-tlslite - Cryptlib, at the moment just to support TLSlite but may end up using it for other things later. http://www.cs.auckland.ac.nz/~pgut001/cryptlib/ FreeBSD: /usr/ports/security/cryptlib ...but the FreeBSD port doesn't (yet?) install the Python bindings, sigh, so at the moment you have to do that by hand: # cd /usr/ports/security/cryptlib # make install # cd work/bindings # python setup.py install # cd ../.. # make clean - Eventually I expect that this will require an event-handling package like Twisted, but I'm not there yet. - The testpoke tool (up-down protocol command line test client) and testbed tools also uses PyYAML. http://pyyaml.org/ FreeBSD: /usr/ports/devel/py-yaml We also use a hacked copy of the Python OpenSSL Wrappers (POW) package, but our copy has enough modifications that it's expanded in the Subversion tree. Depending on how this all works out, I may end up splitting the POW.pkix module out of the POW package and using it with Cryptlib, as the POW.pkix package is 98% about doing ASN.1 in pure Python and only 2% about any kind of crypto. $Revision$ TO DO: - Scripted tests to grow and shrink and revoke and .... See testbed.*.yaml, but more systematic testing needed. PRIORITY: Required TIME REQUIRED: open-ended STATUS: Ongoing - Randy's "user validation tool" (fetch and validate certs and probably the ROA for a prefix I want to accept in a route filter I am building in Python/Perl). This probably uses rcync's output as one of its inputs. This is a basic tool for a sysadmin who wants to -use- all this crud we're working so hard to generate. It's not required for the generation tools to work, but without it the entire toolset does nothing obviously useful, which will make it a very hard sell during the limited public test stage. PRIORITY: Required DEPENDS ON: ROA generation TIME REQUIRED: three days STATUS: Not started - Common protocol dump format with APNIC and other implementors so we can read each other's dumps. "Obvious" format would be an OpenSSL-style PEM of the CMS, with a "text" portion (the place where "openssl x509 -text" would put a text dump of a cert) showing the wrapped XML. PRIORITY: Desirable TIME REQUIRED: one day STATUS: Not started - Clean unused cruft out of left-right protocol, or at least have control booleans we don't intend to implement at present signal an error if used. Bottleneck here has been deciding what to punt and what to implement. Removing unused booleans or raising errors when they're used is trivial. PRIORITY: Required TIME REQUIRED: Less than one day STATUS: Error signalling done - Publication protocol and implementation thereof. Protocol design started, Randy had comments that sent me back to the drawing board (he was right). Next step is to integrate Randy's advice, which probably means picking up more of the left-right protocol framework. Desirable although not strictly required that protcol be agreed upon among the RIRs. Might not be practical given how long it takes group to decide anything. Tricky bit is making sure that repository receives enough information to know whether parent has authorized child to use parent's namespace in nesting case. In theory this is straightforward but requires careful checking. ARIN can't host output of non-hosted RPKI engines without this, and that's critical both to the security model as discussed with ARIN staff in late 2006, so I believe we need this capability even as part of the initial limited test. PRIORITY: Required TIME REQUIRED: 1-2 weeks for implementation once protocol settled, depending on how much of the protocol and implementation I can steal from the existing left-right protocol. STATUS: Started - Subsetting (req_* attributes in up-down protocol) Minimal implementation would be to recognize this as correct protocol and signal an internal server error if it's ever used. More serious implementation would require expanding SQL child_cert table to hold subset masks and tweaking almost every bit of code that touches that table. PRIORITY: Required TIME REQUIRED (minimal version): One day TIME REQUIRED (real version): 1-2 weeks STATUS: Not started - Error handling: make sure that exceptions map correctly to up-down error codes, flesh out left-right error codes. Note that the same exception may produce different error codes depending on which up-down PDU we're processing (sigh). Will require code audit for coherency. PRIORITY: Required TIME REQUIRED: four days DEPENDS ON: almost everything else, as almost any code change can raise new exceptions that we'd need to handle. STATUS: Not started - db.commit(), db.rollback(), code audit for data integrity issues, fix any data integrity issues that turn up. Among other issues, we need to handle loss of connnection to database server and other MySQL errors. MySQLdb throws an exception, which we can catch, and retrying is easy enough, but need to be careful about recovery action depending on whether we had uncommitted changes. PRIORITY: Required TIME REQUIRED (commit and rollback): Two weeks TIME REQUIRED (data integrity audit): Three days TIME REQUIRED (fix data integrity): Unknown, depends on code audit and results of runtime testing. DEPENDS ON: async tasking model, sort of -- could do it first, but tasking change will affect the exception handling that triggers rollback. STATUS: Not started - Test with larger data set -- Tim gave me plenty of data, I have the low-level tools and the glue logic to create child objects for all the entities in the IRDB, but I don't yet have logic to poll on behalf of each of them and check result for sanity. Maybe it'd be easier to write something that dumps Tim's database in YAML format for testbed.py to chew on? PRIORITY: Highly desirable TIME REQUIRED (setup): One day to convert Tim's data to YAML TIME REQUIRED (testing): Unknown, depends on what we turn up STATUS: Not started - Clean up rootd.py to be usable in a production system. Most urgent issue is handling of private keys. May not need much else, as this is not a high-traffic server. PRIORITY: Highly desirable (not strictly needed for limited testing) TIME REQUIRED: Two days STATUS: Not started - Test framework, multiple self-instances per engine-instance (single self-instance per engine-instance is already done). PRIORITY: Required DEPENDS ON: async tasking model. TIME REQUIRED: One week STATUS: Not started - tlslite code seems flakey under heavy use, and doesn't support all the cert checks we want. Best bet for getting this right is probably to hack on the POW Ssl class until it supports everything shown in the OpenSSL book; aside from speed, the main advantage here is that there -is- a list of all the things one needs to do to use TLS properly if one follows this recipe, whereas with TLSlite it's all a mystery. Useful side effect of doing this via POW: it brings us back to only needing one crypto library (in particular it lets us punt M2Crypto, which appears to be coded as an accident waiting to happen). PRIORITY: Required (cert checking is a security issue). TIME REQUIRED: Two weeks. DEPENDS ON: Async tasking model. STATUS: Not started - ROA generation. We have a bunch of the primitives for this but we aren't yet generating the ROAs themselves. For reasons that presumably made sense at the time, the left-right protocol for route_origin objects allows ranges as well as prefixes, and the SQL for stores everything as ranges, which is nice and general...except that ROAs can only hold prefixes. So left-right should only allow prefixes, and SQL should only store prefixes. PRIORITY: Required TIME REQUIRED: Three days STATUS: Not started - Make rpkid fully event-driven (async tasking model), except for SQL queries. This probably involves the "twisted" framework. PRIORITY: Required (to implement hosting model) TIME REQUIRED: one week. STATUS: Not started - Update biz trust anchor model to what we came up with in Amsterdam. This was a direct result of security review by Kent and Housley. This has been waiting for work we hope RobK is doing. This is probably not a lot of coding, probably a few extra cert fields in the self object which we then need to toss into the rpki.x509.X509_chain objects before verifying CMS or TLS, and perhaps the existing TA fields in various objects become pairs of certs instead of a single TA, but this is mostly just generalization and reuse of existing code, no bold new adventures. PRIORITY: Required (security issue) TIME REQUIRED: One week. STATUS: Not started - Performance testing STATUS: Not started - rcynic handling of RPKI trust anchors probably needs updating. Discussions over last N months of how RPKI trust anchors work, how we package them, and how we roll them over. The last (TA rollover) is the driver for this. Last I recall (need to check email archives) APNIC had proposed a relatively simple format (CMS signed PEM-encoded X.509 object set, or something like that). Need to do analysis to make sure this is adaquate for our needs, if so just use it. This would involve minor changes to rcynic. Alternatively, this could be a separate program to keep this grot out of rcynic itself, but that's probably a usability nightmare. PRIORITY: Required (usability issue for relying parties) TIME REQUIRED: Three days. STATUS: Not started - rcynic does not yet handle manifests. This is both a real problem (manifests were added to plug a security hole) and a user acceptance problem (without manifest support rcynic checks old certs that are supposed to fail because they've been revoked, resulting in what appear to be spurious errors, which just annoy the user). PRIORITY: Required TIME REQUIRED: One week. STATUS: Not started - Update internals docs (Doxygen). Mostly this means updating function comments in the Python code, as the rest is automatic. May require a bit of overview text to explain the workings of the code, this overview text may well turn out to be just the current flat text documents marked up for inclusion by Doxygen. PRIORITY: Desirable TIME REQUIRED: Two days STATUS: Ongoing - Reorganize code (directory names, module names, which objects are in which modules, add gctx pointers to objects so we can stop passing all these flipping explicit gctx pointers in almost every function call) to make it easier to understand and maintain. Portions of the existing code were done in extreme haste to meet testing deadlines, and it shows. STATUS: File renaming mostly done, other stuff not started TIME REQUIRED: two days PRIORITY: Highly desirable (to preserve programmers' and maintainers' sanity, if nothing else) - Add HSM support. Architecture includes it, current code does not. First step here would be talking to somebody who understands PKCS#11 better than I do, ie, Richard Lamb or Francis Dupont. STATUS: Not started TIME REQUIRED: Unknown PRIORITY: Desirable. Am guessing ARIN does not require this for initial test - Installation packaging, so that rpkid can be installed like a normal package. STATUS: Not started TIME REQUIRED: Three days PRIORITY: Desirable - Tighten up syntax checking in left-right schema. STATUS: Not started TIME REQUIRED: Less than a day PRIORITY: Desirable - Rethink exposing SQL primary indices in protocols. Right now, we use autoincremented SQL indices in many places in the left-right protocol, and they're even expose in a few places in our implementation of the up-down protocol. This is nice and unique but may be operationally fragile, since up-down usage means that URLs contain mechanically assigned identifiers rather than an identifier negotiated between the two parties during contract setup. RobK suggested that we should instead use something like a hash of the client's name, which would be probabilistically unique, would not expose information, but would be stable even if we had to rebuild the database. STATUS: Not started TIME REQUIRED: Two or three days to evaluate. Implementation time if we decide to make a change unknown, but probably on the order of a few days. PRIORITY: Rethinking desirable; reworking unknown Things implemented but not yet tested. - Client side of expiration now assumes that parent will reissue when its IRDB changes. - Parent side of revocation (child_cert objects) and CRL generation implemented. - Parent side of expiration implemented. - Child batch processing loop: regeneration or removal of expired certs based on what's in the IRDB. - Batch regeneration of CRLs and manifests for all CAs. - Protection against up-down operations specifying a class_name that belongs to some other self context. - Rewrote code that handles revoke on shrink to revoke -all- old certs for that key, not just most recent. Not certain, but this may have been the cause of a cert dropping not showing up in the CRL during testing with APNIC in Vancouver. - Kludgy local publication hack seems to work now, including withdrawal. rcynic still whines occasionally, but I think that's just because, without manifest support, rcynic has no way of telling the difference between certs we withdrew on purpose and certs that were removed by an attacker, so the first rcynic run after a cert has been revoked pulls the old cert from the previous rcynic pass, find that it's listed in the CRL, and whines about it. Other random notes: Being able to specify interaction with other servers (not running under testbed) in a testbed.yaml might be useful for interop tests. Kind of breaks testbed's fundamental model, though. Replacing what testbed thinks is a leaf with somebody else would be easy, so maybe we could specify some way to hang a bunch of rpkids under an external parent? Hmm, data needed would look a lot like testpoke.yaml, maybe we can reuse some of that language? There's a three-way tradeoff lurking in the publication protocol, manifest generation, and CRL generation: 1) Consistancy issues for relying parties (eg, don't want to withdraw something that's still listed in the manifest); 2) Efficiency issues for the RPKI engine (eg, generating a new manifest for each individual change during a batch run could be expensive, would prefer to batch up the changes into a single manifest run); and 3) Coherency issues for the RPKI engine (don't want to defer things that could result in loss of state if something bad happens). Considerations (1) and (3) have to dominate, which may mean we take a hit on (2). Most of the explicit calls to sql_fetch*() are now encapsulated in one-line methods. The remaining ones are probably hints at minor bits of abstraction still to be done. Biz certs currently used by test scripts don't include SKI or AKI. I think this is because the test scripts use "openssl x509" rather than "openssl ca" when generating these certs. Not critical, and will probably become completely irrelevant with all-singing all-dancing post-Amsterdam biz cert scripts, but should not be a big problem to fix either if it gets in the way again.