Uses for HTML parser to harvest content of Web pages for NewEco:
- For FilJob: Harvesting job-available (help-wanted) ads from "job boards" such as Monster.Com.
- For VTAorg: Harvesting public-transit schedules (timetables).
- Harvesting poorly-classified images from Google, to better classify them, for specific categories (people, places (on Earth), events and ongoing situations, types of objects, biological classifications/clades, astronomical sights/locations).
Progress towards developing the parse/harvest software:
- Update 2009.Jun.25 (already posted to Twitter): Finished integrating 2009.Mar handling of all common UniCode representations (US-ASCII, Latin-1, UTF-8, etc.), with 2007.Apr tokenizer and collector for SGML, now upgraded to also handle XML, able to parse all of SGML HTML XHTML XML using a single algorithm.
TOP