html - Instapaper-like algorithm -


does of algorithm extracts contents webpage? instapaper?

there 2 steps instapaper does:

  1. find main content block on page (excluding headers, footers, menus etc)
  2. from content block extract , format text

to find content block (typically html block element, div containing key page text content) instapaper uses algorithm 1 used readability. can @ source of readability.js see what's going on, @ core tries find area on page highest text/link ratio, although has other simple scoring metrics (e.g. off top of head, things ratio of text commas, para elements etc) go heuristics.

once have identified root node element, relevant content, you'll need format it, if want can pull node element containing text out of source document , insert yours, in reality you'll want remove existing styles , apply own, standard , feel. if want output nice text-only can use jericho's renderer.

update1: should mention else instapaper - follow 'pagination' links (the "next" or "1", "2", "3" links) of article conclusion, piece may span many pages in original rendered single document.

update2 came across comparison of text extraction algorithms


Comments

Popular posts from this blog

android - Spacing between the stars of a rating bar? -

c# - How to execute a particular part of code asynchronously in a class -