does of algorithm extracts contents webpage? instapaper ? there 2 steps instapaper does: find main content block on page (excluding headers, footers, menus etc) from content block extract , format text to find content block (typically html block element, div containing key page text content) instapaper uses algorithm 1 used readability . can @ source of readability.js see what's going on, @ core tries find area on page highest text/link ratio, although has other simple scoring metrics (e.g. off top of head, things ratio of text commas, para elements etc) go heuristics. once have identified root node element, relevant content, you'll need format it, if want can pull node element containing text out of source document , insert yours, in reality you'll want remove existing styles , apply own, standard , feel. if want output nice text-only can use jericho's renderer . update1 : should mention else instapaper - follow 'pagination' links (...
Comments
Post a Comment