NLP-NLU

By Maurizio Farina | Posted on October 2017 | DRAFT

This tutorial is an overview about NLP and NLU.

ListFeeds.com is an aggregator of feeds. Our goal is built an application able to query feeds using Location.

For this reason ListFeeds engine is built on Crawler and NLP libraries to grab all feeds from different datasource and extrating text to localize the feed.

This post describe which libraries and tooklit analyzed to achieve this target.

DBpedia

DBpedia

"DBpedia is a crowd-sourced community effort to extract structured information from Wikipedia and make this information available on the Web. DBpedia allows you to ask sophisticated queries against Wikipedia, and to link the different data sets on the Web to Wikipedia data." from DBpedia web site

Ontology: currently there are 685 classes described by 2,795 different properties; For a complete list refere to DBpedia Ontoloy Class web page; here just an example:

Class Instances
Place 735,000
Person 1,450,000
Work 411,000
Species 251,000
Organisation 241,000

Important to higlhlight DBpedia has introduced infobox to enhance Wikipedia information stored(from Wikipedia extraction). Thanks to this infobox (Mappings) is possible to add property to the Wikipedia items.

Starting from here is possible to find all Mappings for italian language. Selecting Museo i can see all Property and their onthology. This the Museo template page for a complete explanation. The importat news for us is to search in wikipedia for Museo Mapping and retrieving all "Museo" records from Wikipedia and access all properties descrived in the mapping template.

The DBpedia Ontology can be queried via the DBpedia SPARQL endpoint or from here.

A bit of Theory

NLP is a field of computer science, artificial intelligence and computational linguistics concerned with the interactions between computers and human (natural) languages, and, in particular, concerned with programming computers to fruitfully process large natural language corpora. Challenges in natural language processing frequently involve natural language understanding, natural language generation (frequently from formal, machine-readable logical forms), connecting language and machine perception, dialog systems, or some combination thereof.

Natural language understanding (NLU) is a subtopic of natural language processing in artificial intelligence that deals with machine reading comprehension.

Apache NLP

Stanford CoreNLP

Stanford CoreNLP provides a set of human language technology tools. It can give the base forms of words, their parts of speech, whether they are names of companies, people, etc., normalize dates, times, and numeric quantities, mark up the structure of sentences in terms of phrases and syntactic dependencies, indicate which noun phrases refer to the same entities, indicate sentiment, extract particular or open-class relations between entity mentions, get the quotes people said, etc.

Unfortunetly Stanford CoreNLP doens't support italian but the Tint project contains models for italian language.

SyntaxNet

SyntaxNet, from Google, is an open-source neural network framework implemented in TensorFlow that provides a foundation for Natural Language Understanding (NLU) systems.

The following Link explains why NLP and NLU are so complicated for a software.

References

OpenSource OpeNER Stanford NLP

Commercial SpaCy.io

Documentation

Resource Description
NER resources A curated list of resources dedicated to Natural Language Processing
Italian DBpedia Group 1,5M entities of which 500.000 classified using onthology: 263.000 persona, 144.000 locations, 29.000 movies and so on.
Italian DBpedia download site RDF using turtle serialization format already ready