Screen
Scraping
v.
The act of capturing data from a system or program by snooping
the contents of some display that is not actually intended for
data transport or inspection by programs. Around 1980 this term
referred to tricks like reading the display memory of a smart
terminal through its auxiliary port. Nowadays it often refers
to parsing the HTML in generated web pages with programs designed
to mine out particular patterns of content. In either guise screen-scraping
is an ugly, ad-hoc, last-resort technique that is very likely
to break on even minor changes to the format of the data being
snooped.
Deep Web/Hidden Web
n.
The Deep Web (or Hidden Web) comprises all information that resides
in autonomous databases behind portals and information providers'
web front-ends. Web pages in the Deep Web are dynamically-generated
in response to a query through a web site's search form and often
contain rich content. A recent study has estimated the size of
the Deep Web to be more than 500 billion pages, whereas the size
of the "crawlable" web is only 1% of the Deep Web (i.e.,
less than 5 billion pages). Even those web sites with some static
links that are "crawlable" by a search engine often
have much more information available only through a query interface.
Unlocking this vast deep web content presents a major research
challenge.
垂直搜索
垂直搜索的本质是对垂直门户信息提供方式的一次简化性的整合。
普通水平搜索引擎的搜索范围为网页级,而垂直搜索的搜索范围为数据项级,粒度更小,精确度更高。垂直搜索是服务于某项功能的,比如:用户搜索租房,买房信息就是一种垂直搜索。对信息的再加工处理是非常关键的,不管是结构化的数据,还是非结构化的数据。
垂直搜索的内容来源: A门户网站自身的资源 B以开放接口方式让行业用户提供的资源 C普通用户发布的资源 D抓取行业用户的资源
更多... |