Semantic Web Archives - Go Fish Digital

Identifying Entity Attribute Relations

Bill Slawski — Wed, 02 Mar 2022 15:31:02 +0000

This patent, granted March 1, 2022, is about identifying entity-attribute relationships in bodies of text.

Search applications, like search engines and knowledge bases, try to meet a searcher’s informational needs and show the most advantageous resources to the searcher.

Structured Data May Help With Identifying Attribute Relationships Better

Identifying attributes entity relationships gets done in structured search results.

Structured search results present a list of attributes with answers for an entity specified in a user request, such as a query.

So, the structured search results for “Kevin Durant” may include attributes such as salary, team, birth year, family, etc., along with answers that provide information about these attributes.

Related Content:

Constructing such structured search results can need identifying entity-attribute relations.

An entity-attribute relation is a particular case of a text relation between a pair of terms.

The first term in the pair of terms is an entity, a person, place, organization, or concept.

The second term is an attribute or a string that describes an aspect of the entity.

Examples include:

“Date of birth” of a person
“Population” of a country
“Salary” of the athlete
“CEO” of an organization

Providing more information in content and schema (and structured data) about entities gives a search engine more information to explore better information about the specific entities, to test and collect data, disambiguate what it knows, and have more and better confidence about the entities that it is aware of.

Entity-Attribute Candidate Pairs

This patent obtains an entity-attribute candidate pair to define an entity and an attribute, where the attribute is a candidate attribute of the entity. In addition to learning from facts about entities in structured data, Google can use information by looking at the context of that information and learn from vectors and co-occurrence of other words and facts about those entities too.

Take a look at the word vectors patent to get a sense of how a search engine may now get a better sense of the meanings and context of words and information about entities. (This is a chance to learn from patent exploration about how Google is now doing some of the things it is doing.) Google collects facts and data about the things it indexes and may learn about the entities that it has in its index, and the attributes it knows about them.

It does this in:

Determining, with sentences that include the entity and attribute, whether the attribute is an actual attribute of the entity in the entity-attribute candidate pair

Generating embeddings for words in the set of sentences that include the entity and the attribute

Creating, with known entity-attribute pairs, a distributional attribute embedding for the entity, where the distributional attribute embedding for the entity specifies an embedding for the entity based on other attributes associated with the entity from the known entity-attribute pairs

Based on embeddings for words in the sentences, the distributional attribute embedding for the entity, and for the attribute, whether the entity-attribute candidate pair is an essential attribute of the entity in the entity-attribute candidate pair.

Embeddings For Words Get Made Of Sentences With The Entity And The Attribute

Building a first vector representation specifying the first embedding of words between the entity and the point in the set of sentences

Making a second vector representation defining a double embedding for the entity based on the set of sentences

Constructing a third vector representation for a third embedding for the attribute based on the set of sentences

Picking, with a known entity attribute, combines a distributional attribute embedding for the entity, means making a fourth vector representation, using available entity-attribute pairs, specifying the distributional attribute embedding for the entity.

Building a distributional attribute embedding with those known entity-attribute pairs means developing a fifth vector representation with available entity-attribute teams and the distributional attribute embedding for the attribute.

Deciding, based on the embeddings for words in the set of sentences, the distributional attribute embedding for the entity, and the distributional attribute embedding for the attribute, whether the attribute in the entity-attribute candidate pair is an essential attribute of the entity in the entity-attribute candidate pair

Determining, based on the first vector representation, the second vector representation, the third vector representation, the fourth vector representation, and the fifth vector representation, whether the attribute in the entity-attribute candidate pair is an essential attribute of the entity in the entity-attribute candidate pair

Choosing, from the first vector representation, the second vector representation, the third vector representation, the fourth vector representation, and the fifth vector representation, whether the attribute in the entity-attribute candidate pair is an essential attribute of the entity in the entity-attribute candidate pair, get performed using a feedforward network.

Picking, based on the first vector representation, the second vector representation, the third vector representation, the fourth vector representation, and the fifth vector representation, whether the attribute in the entity-attribute candidate pair is an essential attribute of the entity in the entity-attribute candidate pair, comprises:

Generating a single vector representation by concatenating the first vector representation, the second vector representation, the third vector representation, the fourth vector representation, and the fifth vector representation; inputting the single vector representation into the feedforward network

Determining, by the feedforward network and using the single vector representation, whether the attribute in the entity-attribute candidate pair is an essential attribute of the entity in the entity-attribute candidate pair

Making a fourth vector representation, with known entity-attribute pairs, specifying the distributional attribute embedding for the entity comprises:

Identifying a set of attributes associated with the entity in the known entity-attribute teams, wherein the set of attributes omits the attribute
Generating a distributional attribute embedding for the entity by computing a weighted sum of characteristics in the set of attributes

Choosing a fifth vector representation, with known entity-attribute pairs, specifying the distributional attribute embedding for the attribute comprises

Identifying, using the attribute, a set of entities from among the known entity-attribute couples; for each entity in the collection of entities
Determining a set of features associated with the entity, where the location of attributes does not include the attribute
Generating a distributional attribute embedding for the entity by computing a weighted sum of characteristics in the collection of attributes

The Advantage Of More Accurate Entity-Attribute Relations Over Prior Art Model-Based Entity-Attribute Identification

Earlier art entity-attribute identification techniques used model-based approaches such as natural language processing (NLP) features, distant supervision, and traditional machine learning models, which identify entity-attribute relations by representing entities and attributes based on data sentences. These terms appear.

In contrast, the innovations described in this specification identify entity-attribute relations in datasets by using information about how entities and attributes get expressed in the data within which these terms appear and by representing entities and attributes using other features that get known to get associated with these terms. This enables representing entities and attributes with details shared by similar entities, improving the accuracy of identifying entity-attribute relations that otherwise cannot be discerned by considering the sentences within which these terms appear.

For example, consider a scenario in which the dataset includes sentences that have two entities, “Ronaldo” and “Messi,” getting described using a “record” attribute, and a penalty where the entity “Messi” gets escribed using a “goals” attribute. In such a scenario, the prior art techniques may identify the following entity attribute pairs: (Ronaldo, record), (Messi, log), and (Messi, goals). The innovations described in this specification go beyond these prior art approaches by identifying entity-attribute relations that might not be discerned by how these terms get used in the dataset.

Using the above example, the innovation described in this specification determines that “Ronaldo” and “Messi” are similar entities because they share the “record” attribute and then represent the “record” attribute using the “goals” attribute. In this way, the innovations described in this specification, for example, can enable identifying entity-attribute relations, e.g., (Cristiano, Goals), even though such a relationship may not be discernible from the dataset.

The Identifying Attribute Relationships Patent

Identifying entity attribute relations

Inventors: Dan Iter, Xiao Yu, and Fangtao Li

Assignee: Google LLC

US Patent: 11,263,400

Granted: March 1, 2022

Filed: July 5, 2019

Abstract

Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, that ease identifying entity-attribute relationships in text corpora.

Methods include determining whether an attribute in a candidate entity-attribute pair is an actual attribute of the entity in the entity-attribute candidate pair.

This includes generating embeddings for words in the set of sentences that include the entity and the attribute and generating, using known entity-attribute pairs.

This also includes generating an attribute distributional embedding for the entity based on other attributes associated with the entity from the known entity-attribute pairs, and generating an attribute distributional embedding for the attribute based on known attributes associated with known entities of the attribute in the known entity-attribute pairs.

Based on these embeddings, a feedforward network determines whether the attribute in the entity-attribute candidate pair is an actual attribute of the entity in the entity-attribute candidate pair.

Identifying Entity Attribute Relationships In Text

A candidate entity-attribute pair (where the attribute is a candidate attribute of entity) is input to a classification model. The classification model uses a path embedding engine, a distributional representation engine, attribute engine, and a feedforward network. It determines whether the attribute in the candidate entity-attribute pair is an essential entity in the candidate entity-attribute pair.

The path embedding engine generates a vector representing an embedding of the paths or the words that connect the everyday occurrences of the entity and the attribute in a set of sentences (e.g., 30 or more sentences) of a dataset. The distributional representation engine generates vectors representing an embedding for the entity and attributes terms based on the context within which these terms appear in the set of sentences. The distributional attribute engine generates a vector representing an embedding for the entity and another vector representing an embedding for the attribute.

The attribute distributional engine’s embedding for the entity gets based on other features (i.e., attributes other than the candidate attribute) known to get associated with the entity in the dataset. The detailed distributional engine’s embedding for the quality gets based on different features associated with known entities of the candidate attribute.

The classification model concatenates the vector representations from the path embedding engine, the distributional representation engine, and the distributional attribute engine into a single vector representation. The classification model then inputs the single vector representation into a feedforward network that determines, using the single vector representation, whether the attribute in the candidate entity-attribute pair is an essential attribute of the entity in the candidate entity-attribute pair.

Suppose the feedforward network determines that the point in the candidate entity-attribute pair is necessary for the entity in the candidate entity-attribute pair. In that case, the candidate entity-attribute pair gets stored in the knowledge base along with other known/actual entity-attribute pairs.

Extracting Entity Attribute Relations

The environment includes a classification model that, for candidate entity-attribute pairs in a knowledge base, determines whether an attribute in a candidate entity-attribute pair is an essential attribute of the entity in the candidate pair. The classification model is a neural network model, and the components get described below. The classification model can also be used using other supervised and unsupervised machine learning models.

The knowledge base, which can include databases (or other appropriate data storage structures) stored in non-transitory data storage media (e.g., hard drive(s), flash memory, etc.), holds a set of candidate entity-attribute pairs. The candidate entity-attribute pairs get obtained using a set of content in text documents, such as webpages and news articles, obtained from a data source. The Data Source can include any source of content, such as a news website, a data aggregator platform, a social media platform, etc.

The data source obtains news articles from a data aggregator platform. The data source can use a model. The supervised or unsupervised machine learning model (a natural language processing model) generates a set of candidate entity-attribute pairs by extracting sentences from the articles and tokenizing and labeling the extracted sentences, e.g., as entities and attributes, using part-of-speech and dependency parse tree tags.

The data source can input the extracted sentences into a machine learning model. For example, it can get trained using a set of training sentences and their associated entity-attribute pairs. Such a machine learning model can then output the candidate entity-attribute teams for the input extracted sentences.

In the knowledge base, the data source stores the candidate entity-attribute pairs and the sentences extracted by the data source that include the words of the candidate entity-attribute pairs. The candidate entity-attribute pairs are only stored in the knowledge base if the number of sentences in which the entity and attribute are present satisfies (e.g., meets or exceeds) a threshold number of sentences (e.g., 30 sentences).

A classification model determines whether the attribute in a candidate entity-attribute pair (stored in the knowledge base) is an actual attribute of the entity in the candidate entity-attribute pair. The classification model includes a path embedding engine 106, a distributional representation source, an attribute engine, and a feedforward network. As used herein, the term engine refers to a data processing apparatus that performs a set of tasks. The operations of these engines of the classification model in determining whether the attribute in a candidate entity-attribute pair is an essential attribute of the entity.

An Example Process For Identifying Entity Attribute Relations

Operations of the process are described below as being performed by the system’s components, and functions of the process are described below for illustration purposes only. Operations of the process can get accomplished by any appropriate device or system, e.g., any applicable data processing apparatus. Functions of the process can also get implemented as instructions stored on a non-transitory computer-readable medium. Execution of the instructions causes data processing apparatus to perform operations of the process.

The knowledge base obtains an entity-attribute candidate pair from the data source.

The knowledge base obtains a set of sentences from the data source that include the words of the entity and the attribute in the candidate entity-attribute pair.

Based on the set of sentences and the candidate entity-attribute pair, the classification model determines whether the candidate attribute is an actual attribute of the candidate entity. The set of penalties can be a large number of sentences, e.g., 30 or more sentences.

The Classification Model Performing The Following Operations

Embeddings for words in the set of sentences that include the entity and the attribute get described in greater detail below concerning the process below

Created using known entity-attribute pairs, a distributional attribute embedding for the entity, which gets described in greater detail below concerning operation

Building, using the known entity-attribute pairs and distributional attribute embedding for the attribute, which gets described in greater detail below concerning operation

Choosing, based on the embeddings for words in the set of sentences, the distributional attribute embedding for the entity, and the distributional attribute embedding for the attribute, whether the attribute in the entity-attribute candidate pair is an essential attribute of the entity in the entity-attribute candidate pair, which gets described in greater detail below concerning operation.

The path embedding engine generates a first vector representation specifying the first words embedding between the entity and the attribute in the sentences. The path embedding engine detects relationships between candidate entity-attribute terms by embedding the paths or the words that connect the everyday occurrences of these terms in the set of sentences.

For the phrase “snake is a reptile,” the path embedding engine generates an embedding for the track “is a,” which can get used to detect, e.g., genus-species relationships, that can then get used to identifying other entity-attribute pairs.

Generating The Words Between the Entity And The Attribute

The path embedding engine does the following to generate words between the entity and the attribute in the sentences. For each sentence in the set of sentences, the path embedding engine first extracts the dependency path (which specifies a group of words) between the entity and the attribute. The path embedding engine converts the sentence from a string to a list, where the first term is the entity, and the last term is the attribute (or, the first term is the attribute and the previous term is the entity).

Each term (which is also referred to as an edge) in the dependency path gets represented using the following features: the lemma of the term, a part-of-speech tag, the dependency label, and the direction of the dependency path (left, right or root). Each of these features gets embedded and concatenated to produce a vector representation for the term or edge (V.sub.e), which comprises a sequence of vectors (V.sub.l, V.sub.pos, V.sub.dep, V.sub.dir), as shown by the below equation: {right arrow over (v)}.sub.e=[{right arrow over (v)}.sub.l,{right arrow over (v)}.sub.pos,{right arrow over (v)}.sub.dep,{right arrow over (v)}.sub.dir]

The path embedding engine then inputs the sequence of vectors for the terms or edges in each path into an long short-term memory (LSTM) network, which produces a single vector representation for the sentence (V.sub.s), as shown by the below equation: {right arrow over (v)}.sub.s=LSTM({right arrow over (v)}.sub.e.sup.(1). . . {right arrow over (v)}.sub.e.sup.(k))

Finally, the path embedding engine inputs the single vector representation for all sentences in the set of sentences into an attention mechanism, which determines a weighted mean of the sentence representations (V.sub.sents(e,a)), as shown by the below equation: {right arrow over (v)}.sub.sents(e,a)=ATTN({right arrow over (v)}.sub.s.sup.(1). . . {right arrow over (v)}.sub.s.sup.(n))

The distributional representational model generates a second vector representation for the entity and a third vector representation for the attribute based on the sentences. The distributional representation engine detects relationships between candidate entity-attribute terms based on the context within which point and the entity of the candidate entity-attribute pair occur in the set of sentences. For example, the distributional representation engine may determine that the entity “New York” gets used in the collection of sentences in a way that suggests that this entity refers to a city or state in the United States.

As another example, the distributional representation engine may determine that the attribute “capital” gets used in the set of sentences in a way that suggests that this attribute refers to a significant city within a state or country. Thus, the distributional representation engine generates a vector representation specifying an embedding for the entity (V.sub.e) using the context (i.e., the set of sentences) within which the entity appears. The distributional representation engine generates a vector representation (V.sub.a) specifying an embedding for the attribute using the set of sentences in which the feature appears.

The distributional attribute engine generates a fourth vector representation specifying a distributional attribute embedding for the entity using known entity-attribute pairs. The known entity-attribute pairs, which get stored in the knowledge base, are entity-attribute pairs for which it has gotten confirmed (e.g., using prior processing by the classification model or based on a human evaluation) that each attribute in the entity-attribute pair is an essential attribute of the entity in the entity-attribute couple.

The distributional attribute engine performs the following operations to determine a distributional attribute embedding that specifies an embedding for the entity using some (e.g., the most common) or all the other known attributes among the known entity-attribute pairs with which that entity gets associated.

Identifying Other Attributes For Entities

For entities in the entity-attribute candidate pair, the distributional attribute engine identifies attributes other than those included in the entity-attribute candidate pair associated with the entity in the known entity-attribute teams.

For an entity “Michael Jordan” in the candidate entity-attribute pair (Michael Jordan, famous), the attribute distributional engine can use the known entity-attribute pairs for Michael Jordan, such as (Michael Jordan, wealthy) and (Michael Jordan, record), to identify attributes such as affluent and description.

The attribute distributional engine then generates an embedding for the entity by computing a weighted sum of the identified known attributes (as described in the preceding paragraph), where the weights get learned using through an attention mechanism, as shown in the below equation: {right arrow over (v)}.sub.e=ATTN(.epsilon.(.alpha..sub.1) . . . .epsilon.(.alpha..sub.m))

The distributional attribute engine generates a fifth vector representation specifying a distributional attribute embedding for the attribute using the known entity-attribute pairs. The distributional attribute engine performs the following operations to determine a model based on some (whether the most common) or all of the known attributes associated with known entities of the candidate attribute.

For the point in the entity-attribute candidate pair, the distributional attribute engine identifies the known entities among the known entity-attribute couples that have the quality.

For each identified known entity, the distributional attribute engine identifies other attributes (i.e., attributes other than the one included in the entity-attribute candidate pair) associated with the entity in the known entity-attribute teams. The distributional attribute engine can identify a subset of attributes from among the identified attributes by:

(1) Ranking attributes based on the number of known entities associated with each entity, such as assigning a higher rank to attributes associated with a higher number of entities than those associated with fewer entities)

Identifying Entity Attribute Relations is an original blog post first published on Go Fish Digital.

Searching Quotes of Entities Modified at Google

Bill Slawski — Fri, 18 Feb 2022 19:12:29 +0000

The Patent Behind Searching Quotes of Entities Has Been Modified by Continuation Patent Again

When Google updates some processes, they may file an updated patent to protect the intellectual property behind the process. This may mean filing a patent where most of the description for the patent is identical or nearly a duplicate to earlier versions of the patent. Titles sometimes change a little, but the list of authors mostly remains the same (I have seen one where a new author was added.)

Related Content:

Google’s patent “Systems and methods for searching quotes of entities using a database” has been updated a second time. To try to understand what has changed involved reading through the patent’s claims and seeing how the description behind how the patent works has changed.

When the USPTO decides whether or not to grant a patent, they have prosecuting agents go through the claims to see if they are new, non-obvious, and useful. Since a continuation patent is trying to update the protection and use the date of the original patent as the start of the exclusion period, the patent agent makes sure that those new claims are valid before granting a continuation patent.

I first wrote about this patent in an earlier post at Go Fish Digital: Google Searching Quotes of Entities. If you want a good idea of how the process behind that patent worked when it came out originally, I would recommend reading through that post before you go too much further here.

I followed up with a post at SEObythesea: Quote Searching Updated at Google to Focus on Videos. It describes the changes to the process described in the claims to the post, after the first continuation patent.

Those claims have been updated again and provide hints at how Google treats entity information that may have been initially kept in the knowledge graph.

Comparing Claims From The Searching Quotes of Entities Patent Versions

August 8, 2017 – Systems and methods for searching quotes of entities using a database:

1. A computerized system for searching and identifying quotes, the system comprising: a memory device that stores a set of instructions; and at least one processor that executes the set of instructions to: receive a search query for a quote from a user; parse the query to identify one or more key words; match the one or more key words to knowledge graph items associated with candidate subject entities in a knowledge graph stored in one or more databases, wherein the knowledge graph includes a plurality of items associated with a plurality of subject entities and a plurality of relationships between the plurality of items; determine, based on the matching knowledge graph items, a relevance score for each of the candidate subject entities; identify, from the candidate subject entities, one or more subject entities for the query based on the relevance scores associated with the candidate subject entities; identify a set of quotes corresponding to the one or more subject entities; determine quote scores for the identified quotes based on at least one of the relationship of each quote to the one or more subject entities, the recency of each quote, or the popularity of each quote; select quotes from the identified quotes based on the quote scores; and transmit information to a display device to display the selected quotes to the user.

February 5, 2019 – Systems and methods for searching quotes of entities using a database:

1. A method comprising the following operations performed by one or more processors: receiving audio content from a client device of a user; performing audio analysis on the audio content to identify a quote in the audio content; determining the user as an author of the audio content based on recognizing the user as the speaker of the audio content; identifying, based on words or phrases extracted from the quote, one or more subject entities associated with the quote; storing, in a database, the quote, and an association of the quote to the subject entities and to the user being the author; subsequent to storing the quote and the association: receiving, from the user, a search query; parsing the search query to identify that the search query requests one or more quotes by the user about one or more of the subject entities; identifying, from the database and responsive to the search query, a set of quotes by the user corresponding to the one or more of the subject entities, the set of quotes including the quote; selecting the quote from the quotes of the set based at least in part on the recency of each quote; and transmitting, in response to the search query, information for presenting the selected quote to the user via the client device or an additional client device of the user.

Compare those first two claims to the first claim from the newest version of the patent, which was granted earlier this week. It has a few changes from the first two versions.

February 15, 2022 – Systems and methods for searching quotes of entities using a database:

1. A computer system, the system comprising: a memory device that stores a set of instructions; and at least one processor that executes the set of instructions to: retrieve an electronic resource, wherein the electronic resource is a webpage or is a document; parse the electronic resource to identify one or more key words; match the one or more key words to a subject entity from a subject entity database; identify a plurality of quotes based on the subject entity of the subject entity database, wherein each quote of the plurality of quotes is identified from an additional electronic resource comprising a webpage; identify an additional subject entity that is associated with the subject entity of the subject entity database; select a subset of the identified plurality of quotes based on the subset of the identified quotes being associated with the additional subject entity; determine quote scores for the subset of identified quotes, wherein each of the quote scores is for a corresponding one of the quotes of the subset and is determined based on one or multiple of: a relationship of the corresponding quote to the subject entity, a recency of the corresponding quote, and a popularity of the corresponding quote; select, based on the quote scores, a quote from the subset of identified quotes; and transmit information to a client device accessing the electronic resource, wherein transmitting the information causes the client device to display the selected quote and a selectable hyperlink to the webpage from which the selected quote was identified.

The titles of the patents haven’t changed, and no authors were added. The Drawings and most of the descriptions are the same.

The first claim refers to “matching keywords to knowledge graph items.”
The second First Claim does not include the knowledge graph and says that the process is about “performing audio analysis on the audio content to identify a quote in the audio content.”
The Newest First Claim replaces the knowledge graph from the first version when it says it will “match the one or more keywords to a subject entity from a subject entity database.”
Unlike the first two versions, the Newest First Claim describes making the quote information attributable.

So I Am Left With Questions About Searching Quotes of Entities

Why is a subject entity database introduced, and why might that be different from the knowledge graph? It sounds like it could be a thesaurus of information that isn’t browseable and transparent to searchers the way that information from the knowledge graph might be. Is other information about entities kept separate from the knowledge graph, too, until it is decided how best to display that information?
Where are the quotes stored? The second first claim tells us that audio content is analyzed instead of looking in the knowledge graph. In the third first claim, searching quotes of entities is done by looking through quotes in the subject entity database. When I first read the second version of the patent, I took it to mean that information about the quotes was kept in an index of video information. It is, however, likely that Google may have information about some quotes that it doesn’t necessarily have a video for.
Where is quote information coming from? The third first claim tells us that it may provide to the searcher a “selectable hyperlink to the webpage from which the selected quote was identified.” The two earlier versions state that the quote may be presented to a searcher but does not mention in any way providing attribution to the source of the quote or information about it. Attribution now seems more important to Google, where the second first claim seemed to assume that information may be coming from YouTube.

The Newest Version of the Searching Quotes of Entities Patent

Systems and methods for searching quotes of entities using a database
Inventors: Eyal Segalis, Gal Chechik, Yossi Matias, Yaniv Leviathan, and Yoav Tzur
Assignee: GOOGLE LLC
US Patent: 11,250,052
Granted: February 15, 2022
Filed: December 26, 2018

Abstract

Systems and methods are provided for searching and identifying quotes in response to a query from a user.

Consistent with certain embodiments, systems and methods are provided for identifying one or more subject entities associated with the query and identifying, from a database or search results obtained in response to the query, a set of quotes corresponding to the one or more subject entities.

Further, systems and methods are provided for determining quote scores for the identified quotes based on at least one of the relationships of each quote to the one or more subject entities, the recency of each quote, and the popularity of each quote.

Additionally, systems and methods are provided for organizing the identified quotes in a rank order based on the quote scores and selecting quotes based on the rank order or the quote scores. In addition, systems and methods are provided for transmitting information to display the selected quotes on a display device.

Searching Quotes of Entities Modified at Google is an original blog post first published on Go Fish Digital.

Clustering Entities in Google SERPs Updated

Bill Slawski — Wed, 12 Jan 2022 22:43:07 +0000

The Clustering Entities Patent Is Updated

One of my latest blog posts was about Google clustering news results by topic in organic search results. Google has clustered information about entities in search results as well. If you now search for people who acted with Humprey Bogart in Casablanca. You can see other actors in that movie in those search results. You can also see related questions that include those actors and the film (and that ontology about associated categories for the movie). This new post is about entity clustering and a change to how Google is delivering search results related to entity clustering.

Related Content:

Here is an example of search results that show connections between actors and the movie Casablanca:

Google has a continuation patent from January 3, 2022. I had written about an earlier version of that patent in 2019 in the post Entity Clustering in Google Search Results

Claims From the First Patent

Since this new patent is a continuation patent, most of the patent is identical. The patent contains updated claims. The first claim from the 2019 version of the Clustering Search Results patent reads as follows:

1. A method comprising: determining items responsive to a query; generating first-level clusters of the items, each cluster representing an entity in a knowledge base and including items mapped to the entity; calculating a respective cluster score for each first-level cluster, wherein the respective cluster score for a first-level cluster is based on a respective silhouette score that measures coherence and separation of the first-level cluster and on a silhouette ratio representing a percentage of all first-level clusters having a respective silhouette score above a threshold; merging the first-level clusters based on entity ontology relationships and on respective cluster scores calculated for the merged clusters, wherein the respective cluster score of a merged cluster represents a better score than the respective cluster scores for first-level clusters included in the merged cluster; applying hierarchical clustering to the merged clusters, producing final clusters that maximize respective cluster scores for the hierarchical clustering; and providing the items responsive to the query for display according to the final clusters.

Claims From the Updated Patent

In detail, the post I wrote in 2019 describes the process behind the clustering entities patent. Now, the new version of the patent from the first day of 2022 has a new language that tells us what the patent does. The first set of claims in 1999 told us about a “silhouette score,” which is not in the new claims. The 2022 claims include some terms that aren’t in the 2019 version:

1. A method performed by a search engine comprising: determining a set of items responsive to a query; for each item of the set of items determined to be responsive to the query: identifying one or more entities associated with the item, and obtaining an embedding for the item; generating first-level clusters from the set of items, each cluster representing an entity of the one or more entities; producing final clusters by merging the first-level clusters based on entity ontological relationships and embedding similarities determined using the item embeddings, wherein the entity ontological relationships include hypernym, synonym, and co-hypernym; and providing items from the set of items responsive to the query for display according to the final clusters.

2. The method of claim 1, wherein first-level clusters that are smaller are merged first.

3. The method of claim 2, wherein merging the first-level clusters that are smaller includes, for a first first-level cluster: determining a second first-level cluster and a third first-level cluster related to the first first-level cluster based on the entity ontological relationships; determining that the third first-level cluster and the first first-level cluster are smaller than the second first-level cluster; and merging the first first-level cluster with the third first-level cluster.

4. The method of claim 1, wherein first-level clusters that are most similar are merged first.

5. The method of claim 4, wherein merging first clusters that are most similar first includes, for a first first-level cluster: determining a second first-level cluster and a third first-level cluster related to the first first-level cluster in the entity ontological relationships; determining that the first first-level cluster is more similar to the second first-level cluster than the third first-level cluster; and merging the first first-level cluster with the second first-level cluster.

The newer version tells us it includes “ontological relationships,” which the first set of claims doesn’t. So, we know from the SERPs that Bogart was in the Movie “Casablanca,” as were many other actors who were focused on that search result.

Clustering search results
Inventors: Jilin Chen, Dai; Lichan Hong, Tianjiao Zhang, Huazhong Ning, and Ed Huai-Hsin Chi
Assignee: Google LLC
US Patent: 11,216,503
Granted: January 4, 2022
Filed: November 26, 2019

Abstract

Implementations provide an improved system for presenting search results based on entity associations of the search items. An example method includes generating first-level clusters of items responsive to a query, each cluster representing an entity in a knowledge base and including items mapped to the entity, merging the first-level clusters based on entity ontology relationships, applying hierarchical clustering to the merged clusters, producing final clusters, and initiating display of the items according to the final clusters. Another example method includes generating first-level clusters from items responsive to a query, each cluster representing an entity in a knowledge base and including items mapped to the entity, producing final clusters by merging the first-level clusters based on an entity ontology and an embedding space that is generated from an embedding model that uses the mapping, and initiating display of the items responsive to the query according to the final clusters.

If you travel back to my original writeup of this clustering entities patent from 2019, you will see that I mention “ontologies” many times when writing about entities. The 2022 version of the clustering entities patent adds that language directly to the claims. They are in the SERPs without discussing the relationship between the movie and its actors.

Clustering Entities and News

After this change, when we search for a specific entity and news, we also see clustered search results there as well:

So Google is no longer sorting SERPs based on how good a match documents are for query terms – Google is clustering topics and relationships between entities as part of its decision on what to include in search results.

Clustering Entities in Google SERPs Updated is an original blog post first published on Go Fish Digital.

Updated: Inferring Geographic Entity Locations in Queries

Bill Slawski — Wed, 01 Dec 2021 17:09:01 +0000

Where Do Entity Locations Appear in Queries?

Some search queries submitted by searchers refer to physical locations. The search engine can quickly provide information about entity locations, which may be in addition to the responsive search results. For example, in response to “Eiffel Tower Paris France,” the search engine could provide information about Paris in addition to the search results about the Eiffel Tower. But, the query “Eiffel Tower” does not specify a physical location.

Inferring Searcher’s Address By IP Address

One way to infer unspecified locations is to identify locations that are geographically near the searcher. In some instances, the geographic location of a searcher can get determined by the searcher’s IP address. A significant weakness in this approach is that it does not provide location information when a searcher seeks information about a distant location—for example, a searcher in Los Angeles seeking information about the Eiffel Tower.

Related Content:

Inferring Searcher’s Address by Lookup Table

Another way to infer unspecified or “missing” locations in a search query is to use a lookup table. When a lookup term in the table matches a term in a query, the corresponding location gets presumed. A drawback to this method is that it does not provide a notion of the prominence of the search entities when there are multiple entities with the same or similar names. Another drawback is that the locations that may get inferred get limited to locations associated with preselected entities in the lookup table.

Deja Vue All Over Again

I was almost finished writing this post when I discovered that I had written about this patent before almost a decade ago in the post: How Google Might Use Query Logs to Find Entity Locations The patent has changed: this one is a continuation patent with new claims, and it is worth comparing the old ones and the new ones to see how different they are, and what has changed from them. After the patent and a summary of it. I compare the first claims from both versions.

Assigning Entity Locations with Websites

Deficiencies and problems with assigning physical entity locations in a query get overcome with queries submitted previously and by capitalizing previously issued queries that refer to physical locations. Information from queries that explicitly or implicitly specify locations can help identify locations for other queries that do not specify locations. The same principle gets applied to websites to infer a physical location for an entity associated with the website.

Computing a location-specific score that represents the likelihood that the respective location gets associated with a website.

The method consists of:

Computing a site confidence value representing a likelihood that the website gets associated with a physical location
Determining a location associated with the website using the location-specific scores and the site confidence value
Storing information indicating that the determined location gets associated with the website for subsequent use when processing respective search queries

A non-transitory computer storage medium stores programs that get executed by the processors of a computer system. The programs include instructions, that when executed by the processors of the computer system, perform the method of associating the locations with a website.

Associating the locations with a query from a client device gets performed by a server system. The device includes the processors and memory storing the programs for execution by the processors. The method includes:

A method of associating the locations with a query gets performed by a server system, which includes the processors and memory storing programs for execution by the processors. The method comprises identifying a query, selecting a set of documents responsive to the specified query, and assigning weights to individual records in the collection of papers based, at least in part, on historical data of searcher clicks selecting search result links in search results produced for historical queries the same as the identified query.

The method further includes identifying respective websites hosting the documents in the set of documents. Each website retrieves location-specific information for the retrieved data for each location with a location-specific score. That score corresponds to the likelihood that the respective location goes to an individual website of the identified websites.

The method also includes, for each location for which location-specific information gets retrieved, totalling the location-specific scores, as weighted by the document weights, to compute a likelihood that the separate location gets associated with the query; and assigning a specific location to the query when criteria get satisfied, the predefined criteria comprising a need that the aggregated likelihood for the particular location exceeds a first predefined value.

A non-transitory computer storage medium stores the programs to get executed by processors of a computer system. The programs include instructions, that when executed by the computer system, perform the associating locations with a query.

Thus methods, systems, and computer-readable storage media get provided that infer a physical location for a website or a search query, using information from previously issued search queries.

Search engines can then provide information related to the appropriate physical location to the searcher, creating an enhanced, more efficient searcher interaction with the search engine.

Inferring geographic locations for entities appearing in search queries
Inventors: Sushrut Suresh Karanjkar, Viswanath Subramanian, and Shashidhar Anil Thakur
Assignee: Google LLC
US Patent: 11,176,181
Granted: November 16, 2021
Filed: October 7, 2019

Abstract

A server system associates locations with a query by identifying the query, selecting a set of documents responsive to the query, and assigning weights to individual records in the group of documents based, at least in part, on historical data of searcher clicks selecting search result links in search results produced for historical queries substantially the same as the identified query. Websites hosting the selected documents get remembered. For each website, location-specific information for the locations gets retrieved, including a location-specific score that corresponds to the likelihood that the respective location corresponds to an individual website.

For each respective location for which location-specific information gets retrieved, aggregating the location-specific scores, as weighted by the document weights, to compute an aggregated likelihood that the individual location gets associated with the query.

A specific location gets assigned to the query when predefined criteria get satisfied.

Comparing the Claims from the Different Versions of the Entity Locations Patents

From the 2012 patent: Inferring Geographic Locations for Entities Appearing in Search Queries We see a relatively short first claim:

1. A method for inferring locations associated with a website, performed by a server system having one or more processors and memory storing one or more programs for execution by the one or more processors, the method comprising: at the server system: identifying a website; for each respective location of a plurality of locations referenced in queries with respective result sets that comprise search result links to documents hosted at the website, computing a location-specific score representing the likelihood that the respective location is associated with the website; computing a site confidence value representing a likelihood that the website is associated with a physical location; determining a location associated with the website using the location-specific scores and the site confidence value; storing information indicating that the determined location is associated with the website for subsequent use when processing respective search queries.

From the 2021 patent: Inferring geographic locations for entities appearing in search queries, here is a much longer claim that mentions using historical data and user click information:

1. A computer-implemented method, comprising:

Receiving, by a computing system comprising one or more computing devices, a search query;

Identifying, by the computing system, a set of documents that are responsive to the search query;

Assigning, by the computing system, a respective weight to each respective document of the set of documents based, at least in part, on historical data, wherein the historical data is indicative of a number of user clicks on one or more search result links corresponding to one or more historical documents produced for historical queries that are substantially the same as the search query;

Identifying, by the computing system, a plurality of websites comprising a respective website for each respective document of the set of documents;

Identifying, by the computing system, one or more location scores associated with each respective document of the set of documents based on each respective website of the plurality of websites, wherein each respective location score of the one or more location scores is indicative of a likelihood that a respective geographic location is associated with the respective document, wherein the one or more location scores comprise one or more location-specific scores for each respective website of the plurality of websites, wherein each respective location-specific score of the one or more location-specific scores corresponds to a respective geographic location, wherein the respective location-specific score is predetermined based, at least in part, on a site click count indicative of a number of user clicks on one or more respective search results of the one or more search results produced for historical queries that are substantially the same as the search query, wherein the one or more respective search results are associated with one or more documents hosted at the respective web site;

Determining, by the computing system, a query score for each respective document of the set of documents based, at least in part, on the respective weight and the one or more location scores associated with the respective document;

Determining, by the computing system, a geographic location associated with the search query based, at least in part, on the query score for each respective document of the set of documents; and providing for display, by the computing system, a subset of the set of documents and information associated with the geographic location.

Entity Locations Patents Conclusions

The much longer more modern patent claim shows that Google is looking at more data, including historical data and user click data. Google has evolved how they find entity locations for queries. This is what I expect to see ten years later, and is what we are getting.

Updated: Inferring Geographic Entity Locations in Queries is an original blog post first published on Go Fish Digital.

Detecting Brand Penetration Over Geographic Locations

Bill Slawski — Wed, 08 Sep 2021 20:39:56 +0000

A Brand Penetration System To Generate Indices Using Brand Detections From Geo-Located Images And Corresponding Locations

It looks like when Google Gets into Brands; It does it in a big way – trying to identify all the Brands it can

This patent relates to determining measures of brand penetration over geographic regions. The patent is about a brand penetration determination system and methods to generate indices based on many brand detections from many geo-located images and corresponding locations within various partitioned sub-regions.

Related Content:

Image content analysis engines were developing and deployed to detect many objects and entities. Data obtained from these engines can get processed for later retrieval and analysis, spanning many applications and carrying a heavy computational load. As such, more technology gets needed to provide valuable data associated with analyzed images and related content while minimizing costs of storing such data, including, for example, the amount of computer memory required to keep such data.

The Advantages of The Brand Penetration Patent

This brand Penetration patent:

Determines brand penetration across a geographic area. It includes splitting, by computers, a geographic area into two or more sub-regions
Decides from images captured at sites within each sub-region, many detections of a brand within each respective sub-region
Generates a brand penetration index for each sub-region by the computers. The brand penetration index is based on the number of detections in the respective sub-region
Stores the brand penetration index for each sub-region in memory in association with an indicator of the respective sub-region
Includes a geographic sub-region determination system configured to partition a geographic area into two or more sub-regions
It contains an image content analysis engine configured to determine, from images captured at sites within each sub-region, many detections of a brand within each respective sub-region

The Brand Penetration Index

The brand penetration index is based on the number of detections in the respective sub-region weighted by a population factor based on a population within the sub-region or a category factor based on a category of goods associated with the brand.

The geographic sub-region determination system can tell, in splitting the geographic area into two or more sub-regions, the number of sub-regions and boundaries of each sub-region to ensure that the population within each sub-region is above a threshold.

The computer also includes tangible, computer-readable media configured to store the brand penetration index for each sub-region associated with an indicator of the respective sub-region.

The operations also include:

Splitting a geographic area into two or more sub-regions
Determining from images captured at sites within each sub-region, many detections of a brand within each respective sub-region
Generating a brand penetration index for each sub-region – based on the number of detections of the brand in the respective sub-region
Storing the brand penetration index for each sub-region in memory that is associated with an indicator of the respective sub-region
Deciding on an electronic content item associated with the brand based at least in part on the brand penetration index for a given sub-region of the two or more sub-regions. That electronic content item gets configured for delivery to and displays on an electric device associated with the given sub-region

The Brand Penetration patent is at:

Brand penetration determination system using image semantic content
Inventors: Yan Mayster, Brian Edmond Brewington, and Rick Inoue
Assignee: Google LLC (Mountain View, CA)
US Patent: 11,107,099
Granted: August 31, 2021
ed: July 12, 2019

Abstract

Example embodiments of the disclosed technology install a brand penetration determination system using image semantic content. A geographic sub-region determination system gets configured to partition a geographic area into two or more sub-regions.

An image content analysis engine gets configured to determine, from images captured at sites within each sub-region, many detections of a brand within each respective sub-region.

A brand penetration index generation system becomes configured to generate a brand penetration index for each sub-region based on the number of detections of the brand in the respective sub-region weighted by factors (e.g., population factor, category factor, etc.), which becomes stored in memory with an indicator of each respective sub-region.

In splitting the geographic area into two or more sub-regions, the number and boundaries of sub-regions get determined to ensure that the population within each sub-region is above a threshold.

Google May Use An Image Content Analysis Model to Detect a Brand

How might Google decide whether there is a brand associated with a geographic region?

It can use an image content analysis model to determine brand detections from many geo-located images
Brand penetration indices and related measures can then become generated based on brand detections and corresponding partitioned locations.
A brand penetration determination system can get configured to determine a distribution of two or more discretized sub-regions within a geographic region.
An associated measure of brand penetration can get determined. Sub-region numbers and boundaries can become committed to ensure that a population within each sub-region is above a threshold.

Why Population is above Thresholds for Sub-Regions

By ensuring that the population of each sub-region is above that threshold, a computational burden imposed in storing statistically relevant brand indices for each sub-region gets reduced. Besides, by ensuring that only brand indices for sub-regions having a statistically significant number of inhabitants get stored, the disclosed systems and method can provide valuable data associated with an imagery corpus while simultaneously minimizing the costs of holding that data in memory.

This technology can allow a user to make an election when systems, programs, or features described herein may enable the collection of user information. That is specific information about a user’s current location, social network, social actions or activities, profession, a user’s preferences, or other user-specific features. It also covers images from computers associated with a user and controls data indicating whether a user gets sent the content or communications from a server.

Personally Identifiable Information

Besides, specific data may become treated in ways before it becomes stored or used to remove personally identifiable information. For example, a user’s identity may get treated so that no personally identifiable information can get determined for the user, or a user’s geographic location may become generalized where location information gets obtained (such as a city, ZIP code, or state level) so that a particular area of a user cannot get determined.

Thus, the user may control what information gets collected, how that information is used, and what information is provided. In other examples, images and content determined from such photos under the disclosed techniques can become treated to ensure that personally identifiable information such as images of people, street names and numbers of residences, and other personal information gets removed.

Determining A Measure Of Brand Penetration Across A Geographic Region

A computer comprising processors can help install aspects of the technology, including a brand penetration determination system. In general, the brand penetration determination system can get configured to determine a measure of brand penetration across a geographic region. The brand penetration determination system can include:

- A geographic sub-region determination system
- An image content analysis engine
- A brand penetration index generation system

How A Geographic Sub-Region Determination System Works

A geographic sub-region determination system can split a geographic area into sub-regions.

Sub-regions can correspond, for example, to discretized cells within the geographic region that can have predetermined or dynamically determined sizes and boundaries based on desired measures of brand penetration.

The brand penetration determination system can become configured to divide the geographic area into two or more sub-regions, each sub-region boundaries according to particular cell size and a particular geographic partitioning.

For example, partitioning a geographic area into sub-regions could correspond to implementing a grid imposed over the geographic area.

Each cell in the grid can become characterized by a given shape, and these can include a square, rectangle, circle, pentagon, hexagon, or another polygon.

The Dimensions can characterize each grid cell. They can look at a width, length, height, and diameter dimension. They can also have a predetermined or dynamically determined distance. That can become a value measured in meters, kilometers, miles, or another suitable variable. Or it can become a grid of cells corresponding to respective sub-regions that can become uniform in size, while the cells can vary in size in other embodiments.

Configuring the Geographic Sub-Region Determination System

The geographic sub-region determination system can get configured to determine, in splitting the geographic area into two or more sub-regions, the number of sub-regions and the boundaries of each sub-region to ensure that the population within each sub-region is above a threshold.

When sub-region boundaries are uniform in size (e.g., as in the form of a uniform grid of cells) or when they become predetermined in size (e.g., corresponding to predetermined geographic partitions such as those corresponding to zip codes, neighborhood boundaries, town/district boundaries, state boundaries, country boundaries, etc.), the number of sub-regions can get reduced by excluding sub-regions whose population does not exceed the threshold.

The population here can correspond to many people within a given geographic area, such as census data or other predefined databases associated with a geographic area.

In this example, when ensuring that the population within each sub-region is above a threshold, the geographic sub-region determination system can determine the number and boundaries of each sub-region.

Each sub-region has a population defined by a threshold corresponding to a number (x) of people in each sub-region.

Why Have A Geographic Sub-Region Determination System?

Referring to the geographic sub-region determination system, the population described can correspond to a determined number of goods associated with people in a given geographic area, such as many homes, vehicles, businesses, electronic devices, or particular categories or subsets such goods.

When ensuring that the population within each sub-region is above a threshold, the geographic sub-region determination system can determine the number and boundaries of each sub-region such that each sub-region has a people defined by a threshold corresponding to a number (y) of goods (e.g., homes, vehicles, businesses, electronic devices, etc.) in each sub-region.

The population as described here can correspond to a determined number of images obtained for sites within a particular geographic area and many detections of goods, entities, or the like within such images. In this example, when ensuring that the population within each sub-region is above a threshold, the geographic sub-region determination system can determine the number and boundaries of each sub-region such that each sub-region has a people defined by a threshold corresponding to a number (z) of images and detected objects within the photos in each sub-region.

By partitioning a geographic area into sub-regions, the size of sub-regions is dynamically determined based on population density (or corresponding density of brand detections), the likelihood of including cells corresponding to sub-regions having little to no population brand detections gets reduced.

This gets accomplished at least in part by avoiding a scenario in which many sub-regions have associated brand indices but where each sub-region has such a small number of inhabitants that the data from those sub-regions are not statistically relevant.

Similarly, the size of cells within sub-regions with a more significant number of brand detections can become determined to help ensure an appropriate size to maintain distinctions within a distribution level of brand detections.

This can help ensure that meaningful brand penetration measures can get determined for a geographic area. As such, cell size can vary based on different regions. For example, cell size can get smaller in urban locations and progressively larger when transitioning from urban areas into suburban areas and rural areas.

An image content analysis engine can become configured to determine from images captured at sites within each sub-regions and many detections of a brand within each respective sub-region. The images captured at sites can include a substantially extensive collection of photos. The collection of images can have gotten captured at locations within each sub-region by a camera operating at street level.

For example, the camera operating at street level can become mounted on a vehicle and configured to collect images while the car is traversing street locations within a geographic region.

The collection of images can include, for example, many different photographic images and sequence(s) of images from a video. Such photos get geotagged with geographic identifiers descriptive of a geographic location associated with the camera when it captured each image.

The image content analysis engine can store, include or access machine-learned image content analysis models. For example, the image content analysis models can otherwise include various machine-learned models such as neural networks (e.g., feed-forward, recurrent, and convolutional, neural, etc.) or other multi-layer non-linear models regression-based models, or the like.

The Machine-Learned Image Content Analysis Model(s) Can Detect Text and Logos Associated With A Brand

The machine-learned image content analysis model(s) can:

- - Detect text and logos associated with a brand within each image of a collection of images related to the geographic area
  - Install a text transcription identification and a logo matching title. For example, text transcription and symbols can get compared to text and logo options identified in a predetermined dataset of text transcription and logo options (e.g., names of text and symbols associated with a particular type of label (e.g., terms of vehicle makes))
  - The brand can get associated with a particular category of goods (e.g., vehicles, business entity types, vendor payment types, apparel, shoes, etc.). More particularly, in some examples, the image content analysis engine can get configured to determine each brand detection from the images captured at the one site from an entity storefront appearing within the images.

Brand Detection In Entity Storefront Examples

This section reminded me of storefront Images from Streetview Cameras.

The patent tells us that Brand detection in entity storefront examples can include but are not limited to a brand name associated with the entity itself. It then provides examples such as:

- - Types of fast food stores
  - Convenience stores
  - Gas stations
  - Grocery stores
  - Pharmacies
  - Other types of business entities.

It can also look for vendor payment types associated with the entity (e.g., detecting names and logos indicating that the entity will accept payment from respective credit card companies or the like).

Other Sources of Brand Detection

Vehicles – the image content analysis engine can determine the brand from the images captured at the sites from a car in the pictures.
BillBoards – the image content analysis engine can determine the brand from the images captured at sites from billboard(s) in the pictures.

Appreciate that the subject systems and methods applied to many other examples of images and brands while remaining within the spirit and scope of the disclosed technology.

Referring to example aspects of an image content analysis engine, the machine-learned image content analysis model can get configured to receive each image in the collection of images as input to the machine-learned image content analysis model. The machine-learned image content analysis model can get configured to generate output data associated with detections of brands appearing within the images in response to the collection of pictures. For example, the machine-learned content analysis model can get configured to generate a bounding box associated with each detected brand. Output data can also include labels associated with each detected brand. Each title provides a semantic tag associated with some aspect of the brand. These can be the brand name and goods associated with a brand.

A vehicle within an image, labels associated with the vehicle could include a vehicle model label (e.g., Camry), a car make the label (e.g., Toyota), a vehicle class label (e.g., sedan), a vehicle color label (e.g., white), or other identifiers.

The output data can also include a confidence score descriptive of a probability that the detected brand is correctly detected within that bounding box. Such output data can become aggregated over the collection of images to determine a count descriptive of the number of detections of the brand within the geographic region or within specific sub-regions.

Detections of Brands in Different Sub-Regions

Advertisements appear in many places, and products often have logos identifying them. This patent tells us that it is actively looking for those.

A brand penetration index generation system can become configured to generate a brand penetration index for each sub-region. The brand penetration index gets based on the number of detections of the brand in the respective sub-region. A count descriptive of the number of detections of the brand can be weighted by factors including but not limited to: a population factor based on a population within the sub-region; a category factor based on a category of goods associated with the brand; a source factor based on several source locations for a brand within a sub-region, such as dealerships or stores, etc.

Brand Capacity and Brand Saturation

Brand capacity is the total number of all brands detected in an area or an unlimited number of possible detections based on population within an area).

Brand saturation is an amount or index of detections of similar brands in the area).

This brand penetration index can become determined as representing brand prominence in a particular category. This means, for example, the number of detections of the vehicle make/model in a category such as sedans, luxury cars, all cars.

Refining Detections in the Brand Penetration Index

The brand penetration index generation system can become configured to refine detections by de-duplicating multiple detections associated with a distinct geographic location.

That refining process can help increase accuracy and usefulness within the disclosed techniques while making the disclosed systems and methods more immune to potential disparities and differences associated with a large imagery corpus for determining brand detections.

For example, refining can help reduce potential bias in some portions of a geographic area being more prominent than others.

Potential Disparities of Brand Detections

Potential disparities of brand detections can arise because of:

- - Differences in the total number of images available at different locations. Such as images affected by the speed of the operator, vehicle or human, that took the photos
  - Inconsistencies in the times, circumstances, and weather patterns existing when images become taken
  - The visibility viewshed of each brand object

The patent tells us that these problems can become consistently solved by:

- - Refining brand detection data using knowledge of the operator/vehicle routes
  - When the images got taken
  - Geolocation and pose information from each image
  - The detection box within an image for each detection

Then, it becomes possible to de-duplicate the obtained detections. Each detection box can become associated with a well-defined real-world location, and the number of distinct locations associated with detections within a sub-region can get counted.

Problems with Some Brand Occurrences

Another possible concern about some brand occurrences, especially for vehicles, is that they may not necessarily get associated with the people living in a given geographic area.

This is not expected to become a significant source of error as most travel is local and should dominate the corpus of obtained imagery detections.

But, in some scenarios, it may become desirable to include all detections, whether based on people/homes/vehicles/etc. that are local to the area or simply traveling through it.

Storing The Brand Penetration Index And Tracking Changes Over Time

According to another aspect of the present disclosure, the computer can become configured to store the brand penetration index generated for each sub-region in a regional brand penetration database. The regional brand penetration database can correspond, for example, to a memory including tangible, non-transitory, computer-readable media, or other suitable computer program product(s). The brand penetration index for each sub-region can get stored in the regional brand penetration database associated with an indicator of the respective sub-region. The brand penetration indices for a plurality of sub-regions can become ordered within the brand penetration database. The sub-regions or corresponding indices get stored in a manner indicating the most dominated to least dominated sub-regions (or vice versa) to measures of brand penetration.

The brand penetration index generation system can track how index values stored within the regional brand penetration system database change over time. For example, the brand penetration index generation system can get configured to determine a shift factor indicative of dynamic shifts in each sub-regions brand penetration index over time periods.

The system can get configured to generate flags, notifications, and automated system adjustments when a determined shift factor exceeds predetermined threshold levels. These shift factors and associated notifications can become used to help identify successful brand penetration or areas in which more targeted brand penetration gets desired.

A Targeted Advertisement System Coupled With the Regional Brand Penetration Database

The computer can also include a targeted advertisement system coupled with the regional brand penetration database and an electronic content database.

This targeted advertisement system can become configured to determine an electronic content item from the electronic content database.

This electronic content item can get associated with the brand based at least in part on the brand penetration index for a given sub-region of the two or more sub-regions.

The electronic content item can become configured for delivery to and display on an electric device associated with the given sub-region. For example, an electronic device associated with a sub-region could correspond to an electronic device operating with an IP address or other identifier associated with a physical address located within the sub-region.

In other examples, an electronic device associated with a sub-region could correspond to an electronic device owned by a user living in the sub-region or currently operating in the sub-region. This may become the case with mobile computers or the like.

Serving Brand Content To Users In A Physical Manner

The computer can also include reversing geocoding features that can help determine physical addresses for serving brand content to users in a physical manner instead of an electronic manner.

The brand penetration determination system and targeted advertisement system can more particularly include a reverse geocoding system configured to map cells corresponding to sub-regions within a geographic area to physical addresses within those cells/sub-regions.

The reverse geocoding system can leverage databases and systems that map various geographic coordinates (e.g., latitude and longitude values) associated with cells/sub-regions to physical addresses within or otherwise associated with those areas.

These physical addresses can be stored in memory (e.g., in the regional brand penetration database) and other respective cells/sub-regions indicators and corresponding brand penetration indices.

Measures of brand penetration determined for particular cells/sub-regions can then get used to determine content items. Such as targeted advertisement mailings. These can work for selected delivery to physical addresses within those cells/sub-regions.

By utilizing the disclosed brand penetration index in dynamically determining or adjusting electronic content delivered to users, advertisers can more accurately target products to the most suitable audience.

More particularly, electronic content can become strategically determined for delivery to users within geographic areas having a below-average penetration of a particular brand and above-average penetration of competing brands.

The further analytics implemented by a targeted advertisement system can determine penetration measures of variables, including various threshold levels of brand penetration.

Brand Penetration Tracking

The systems and methods described here may provide many technical effects and benefits. For instance, a computer can include a brand penetration determination system that generates meaningful object detection data associated with a large corpus of collected imagery. More particularly, the detection of brands associated with goods, services, and the like within images can become correlated to statistically relevant measures of brand indices representative of brand capacity, prominence, brand saturation, etc., in a computationally workable manner.

Besides, these measures of brand penetration can become advantageously tracked and aggregated over space. Such as various geographic regions. It can also be tracked over time. These can be various windows–times of day, days of the week, months of the year, etc. They can determine alternative data measures.

A further technical benefit of the disclosed technology concerns integrating a geographic sub-region determination system, which can get configured to determine sub-regions numbers and boundaries within a geographic area to ensure that a population within each sub-region is above a threshold. By ensuring that the population of each sub-region is above that threshold, a computational burden gets imposed in storing statistically relevant brand indices for each sub-region gets reduced.

In addition, by ensuring that only brand indices for sub-regions having a statistically meaningful number of inhabitants get stored. The disclosed systems and methods can provide useful data associated with an imagery corpus while simultaneously minimizing the costs of storing that data in memory as the disclosed technology can achieve such, specific improvements in computing technology.

Integration Within A Targeted Advertisement System

A further technical effect and benefit of the disclosed technology can get realized when the disclosed technology becomes integrated within the application of a targeted advertisement system. Advertisers are commonly faced with determining how to accurately target product(s) and services to the most suitable audience.

This problem can involve many variables, where various demographics, cultural differences, infrastructure, and other considerations come into play. Such technology is used by providing systems and methods for generating computationally efficient and meaningful measures of geographic partitioning and corresponding brand penetration. In combination with other demographic indicators, this can be possible to dynamically develop advertising strategies for targeted delivery of electronic content to consumers.

With reference now to the Figures, example embodiments of the present disclosure will become discussed in further detail.

Brand Penetration Conclusion

The patent provides more details about this tracking of Brands in different geographic areas. It potentially can mean a lot of counting a tracking in the physical world. A log of this can be done through image analysis using programs such as streetview. It’s interesting knowing that Google might have a good idea of where all of the brands might be in the future and know things such as brand capacity and brand saturation in different geographic areas. Will Google know brands this well at some point in the future?

In 2020, I wrote about how Google might track product lines more closely on the Web in the post: Google Product Search and Learning about New Product Lines. There was a point in the past where Google did not seem to pay much attention to brands. With that earlier patent on product lines and this one on brand penetration, that has the potential to change in a big way really fast.

Google is showing us that they are using technologies such as image recognition and streetview cameras to learn more about the world around us. This is also seen in a recent post:

An Image Content Analysis And A Geo-Semantic Index For Recommendations

Detecting Brand Penetration Over Geographic Locations is an original blog post first published on Go Fish Digital.

Natural Language Queries in Spreadsheets at Google

Bill Slawski — Wed, 12 May 2021 19:07:04 +0000

Answering Natural Language Queries From Spreadsheets

Imagine a site owner shares information on the web using spreadsheets or databases. I wrote a related post at Natural Language Query Responses. We see Google with patents focused on answering natural language queries using data tables. Today’s blog post is about answering questions about spreadsheets or databases written using natural language.

A searcher may want to access that information combined in many ways. This is something that Google has patented. I will go over this patent granted in May 2021 and what it says.

Related Content:

How The Search Engine Might Make It Easier For a Searcher to Use Data Tables in Spreadsheets

The patent starts with data tables in spreadsheets and then shows how the search engine might make it easier for a searcher to use that information.

A spreadsheet is a data document that includes one or more data tables storing data under different categories.

Sometimes the spreadsheet may perform calculation functions. For example, a searcher may want certain data from the spreadsheet. The searcher can then construct a database search query to look for the desired data they want.

Sometimes, the spreadsheet may not store the searcher-desired data set. The searcher may use available data from the spreadsheet to derive the desired data.

The searcher may review the spreadsheet identifying relevant data entries in the spreadsheet to compile a formula using the calculation function associated with the spreadsheet to calculate the result.

For example, when the spreadsheet records a test score for each student in a class, a searcher may want to know the average score of the class.

Then the searcher may need to compile a formula by summing the test scores and then dividing by the number of students to find the average score of the class.

The data table may then calculate the average score of the class based on the compiled formula.

Thus the searcher may need to compile a formula and input it into the data table for calculation. That may be inefficient when processing a large amount of data. It also requires a high level of knowledge of database operations from the searcher.

The Patent Covers Systems and Methods to Process Natural Language Queries on Data Tables

This granted patent covers systems and methods to process natural language queries on data tables, such as spreadsheets.

According to the patent, natural language queries may originate from a searcher.

A natural language query may find a query term and a grid range from a data table relevant to the query term.

A table summary may include data entities based on the grid range.

A logic operation may then apply to the number of data entities to derive the query term.

The logic operation may change into a formula executable on the data table. Finally, you can apply the formula to the data table to generate a result in response to the natural language query.

Natural language queries can go through a searcher interface at a computer or manually or vocally entered by a searcher.

Those queries can also go to servers from computers through HTTP requests.

Those Natural Language Queries Can Originate In a First Language

Natural language queries can originate in a first language (e.g., non-English, etc.) and then be translated into a second language, e.g., English, for processing.

A grid range at a computer when the data table is not the computer or a server after receiving the natural language query when the data table is at the server.

The data table includes any data table stored at a computer, a remote server, or a cloud.

The number of data entities includes any of dimensions, dimension filters, and metrics.

The result gets seen by a searcher via a visualization format, including any answer statement, a chart, or a data plot.

The search engine may get searcher feedback after the result gets provided to the searcher. For example, the formula gets associated with the natural language query or the query term when the searcher feedback is positive.

When the searcher feedback is negative, an alternative interpretation of natural language queries may be provided. Alternative results may be provided based on the alternative interpretation.

The Patent for Processing a Natural Language Query in Data Tables

Systems and methods for processing a natural language query in data tables
Inventors: Nikunj Agrawal, Mukund Sundararajan, Shrikant Ravindra Shanbhag, Kedar Dhamdhere, McCurley Garima, Kevin Snow, Rohit Ananthakrishna, Daniel Adam Gundrum, Juyun June Song, and Rifat Ralfi Nahmias
Assignee: Google LLC
US Patent: 10,997,227
Granted: May 4, 2021
Filed: January 18, 2017

Abstract

Systems and methods disclosed herein for processing a natural language query on data tables.

According to some embodiments, a natural language query may come from a searcher via a searcher interface.

The natural language query may obtain a query term, and a grid range may show a data table as relevant to the query term.

A table summary may include many data entities based on the grid range.

A logic operation may apply to the plurality of data entities to derive the query term.

The logic operation may translate into a formula executable on the data table. The formula applied to the data table will generate a result in response to the natural language query.

To provide an overall understanding of the disclosure, certain illustrative embodiments will include systems and methods for connecting with remote databases.

In particular, there is a connection between an application and a remote database.

The application modifies the data format imported from the remote database before displaying the modified data to the searcher.

But, a person of ordinary skill in the art that the systems and methods described herein may get adapted and modified as is appropriate for the application and the systems and methods described herein may get used in other suitable applications, and that such other additions and modifications will not depart from the scope thereof.

Generally, the computerized systems described herein may comprise one or more engines. Those will include a processing device or devices, such as a computer, microprocessor, logic device, or other device or processor that works with hardware, firmware, and software to carry out one or more of the computerized methods described herein.

A Searcher Can Enter a Query For Data in Natural Language

Systems and methods for processing a natural language query allow a searcher to enter a query for data in natural language.

The natural language query may translate into a structured database query. When the structured database query indicates the data is not readily available in the data table, existing data entries may identify in the data table that may be relevant to generate the desired data. Then, the formula may be automatically compiled to derive the desired data based on the available data entries.

For example, when the data source includes a spreadsheet that records a test score for each student in a class, a searcher may input a natural language query. That might be, “what is the average score of the class?”

The natural language query may be interpreted and get parsed by extracting terms from the query, such as “what,” “is,” “the,” “average,” “score,” “of,” “the,” and “class.”

Among the extracted terms, the term “average score” may be a key term of the query based on previously stored key terms that are commonly used.

It may then be seen that no data entry is available in the spreadsheet corresponding to the data category “average score,” e.g., no column header corresponds to “average score.”

Logic may derive an “average score” from the existing data entries.

For example, an “average score” may work, to sum up, all the class test scores and divide the sum by the total number of students.

A formula may then be automatically generated to calculate the “average score” and output the calculation result to the searcher in response to the natural language query.

The generated formula may work in association with a tag “average score” such that even when the spreadsheet updates with more data entries, e.g., with new test scores associated with additional students, the formula may still be applicable to automatically calculate an average score of the class, in response to the natural language query.

How A Searcher May Get an Answer About Their Data

A searcher may get an answer about their data faster and more efficiently than by manually entering formulas or doing other forms of analysis by hand.

For searchers who may not know all the spreadsheet features, the platform may help the searcher generate structured queries or even formulas.

The system includes a server, two remote databases (generally referred to as a remote database), searcher devices (generally referred to as a searcher device), and/or other related entities that communicate with one another over a network. The searcher devices contain searcher interfaces (generally referred to as searcher interfaces) respectively.

Each searcher device includes a device such as a personal computer, a laptop computer, a tablet, a smartphone, a personal digital assistant, or any other suitable type of computer or communication device.

Searchers at the searcher device access and receive information from the server and remote databases over the network.

The searcher device may include components, such as an input device and an output device.

How The Searcher Device Works

A searcher may operate the searcher device to input a natural language query via the searcher interface, and the processor may process the natural language query.

The searcher device may process the natural language query and search within a local database.

Also, The searcher device may send the natural language query to a remote server storing data tables and using a processor to analyze the natural language query.

The server may provide updates and may access remote databases a-b for a data query.

Thus, when a natural language query is received at the searcher device, upon translation of the query into a database query, the database query may be performed locally at the searcher device, at the data tables stored at the server, or at the remote databases (e.g., cloud, etc.).

And, the searcher device may have a locally installed spreadsheet application for a searcher to review data and enter a natural language query.

Aspects of Processing Natural Language Queries Using Spreadsheets

Natural language queries go through a searcher interface.

Natural language queries may be questions from searchers such as the growth of monthly total sales” “what is the average score of MATH 301.,”

Natural language queries may be input by a searcher with input devices or articulated by a searcher using a microphone through the searcher device.

Natural language queries may come from an analytics application and go through to the server via an application programming interface (API) from another program.

For example, a business analytics software may include a list of business analytics questions such as “how’s the growth of monthly total sales” in a natural language. Then, the question may go to the server.

Natural language queries may originate in many different natural languages. It may translate into a language compatible with the platform (e.g., the operating system, or the natural language query processing tool, and/or the like), such as English, etc.

Natural language queries may extract key terms, and generate a query string.

The parsing may happen at the searcher device.

The server may receive a parse request over HTTP from the searcher device.

The server may send a request to an analytics module.

An Example Answer to a Natural Language Question

For example, for a natural language question, “what is the monthly growth of sales,” extracted words in the question may get assessed to rule out words such as “what,” “is,” “the,” “of,” etc. as meaningful query terms.

Words such as “monthly growth” and “sales” may query terms based on previously stored query term rules and/or heuristics from previously processed queries.

The query string may optionally go to the server. Alternatively, natural language queries may work within one or more spreadsheets that are locally stored on the searcher device.

One or more data tables or spreadsheets, or a grid range of a spreadsheet, may be relevant to the query string.

A table detection module may output tables detected from originally stored data tables or spreadsheets, e.g., based on heuristics or machine learning. For example, natural language key terms from the query string may identify relevant data tables/spreadsheets.

When the query string includes key terms such as “growth,” “monthly,” “sales,” data tables/spreadsheets that have a column or a row recording may identify monthly sales.

Data Tables/Spreadsheets Could Also Work Based on Previously Used Data Tables for Similar Query Strings,

As another example, data tables/spreadsheets could also work based on previously used data tables for similar query strings, e.g., when a natural language query “what is the monthly distribution of sales” identified a certain data table, the same data table may be for the query “how’s the growth of monthly total sales” as well.

The selected range of cells from the data table may change in orientation if necessary.

The searcher may manually select the cells by selecting a single cell or a range of cells that may belong to a table.

The cells surrounding the selection could have possible table structures.

A table schema may work with the selected range of cells.

Sometimes when the whole table schema is too small, to avoid communication of a large number of small messages from the computer to the server and improve communication efficiency, several table schemas may be in a batch request to the server.

When the identified table is too large to include in an XMLHttpRequest (XHR) request, the searcher device may only send the grid range of the detected table (for chart recommendations), and the server may determine a table structure from the sent grid range.

The server may prepare a table summary by extracting the dimensions, columns, rows, metrics, dimension filters, and/or other characteristics of the detected data table and map the extracted table characteristics to cell ranges or addresses in a spreadsheet.

For example, for a data table recording monthly sales data of the year, the table summary may include the number and index of rows and columns, the corresponding value in a cell identified by the row and column number, the metric of the value, and/or the like.

The server may extract operations to the data table and translate the operations into one or more formulas executable on the data table.

The server may send the formula(s) back to the searcher device. The formula(s) may be in the detected data table to generate a result in response to a natural language query.

In some implementations, the generated result may get a different visualization, such as, but not limited to, a pie chart, a data plot, and/or the like.

Feedback on a Data Tables Result

When the searcher receives the result in response to the original question via a searcher interface, the searcher may provide feedback on the result.

For example, the searcher may provide a positive rating if the result is accurate.

Or, the searcher may submit a negative rating when the result is unsatisfactory, e.g., misinterpreting the question, insufficient data, etc.

When the searcher feedback is positive, the server may save the formula building objects such as the table summary and formula(s) associated with the query string for machine learning purposes. The formula may be a reference with similar questions.

When the searcher feedback is negative, the server may disassociate the formula-building objects with the question. When similar questions get received, such questions are not interpreted in the same way.

The server may optionally obtain further information from the searcher’s feedback on the result.

For example, if the searcher asks, “how’s the monthly growth of sales,” and a result of the monthly increase from last month to the current month is, although, still. Still, the searcher submits negative feedback. The searcher interface may prompt the searcher to provide further information.

The searcher may re-enter the question with a time period “how’s the monthly growth of sales from ______ to ______?” Or the searcher interface may prompt the searcher to confirm whether the identified data entities “monthly growth” and “sales” are accurate.

Another example is that the searcher interface may provide suggested queries to the searcher if the server fails to parse and identify what the natural language query is.

Other additional methods may work for the searcher to provide further detailed feedback to refine the question.

The server may provide an alternative interpretation of the query string based on information obtained and generate an alternative formula using the alternative table summary.

Then the server may proceed to provide the updated result to the searcher.

Data Flows between the Client-Side and the Server-Side to Process Data Tables Natural Language Queries

A searcher interface may present an answer panel, which may post a query request to a backend.

The query may include a query string (e.g., the question asked by a searcher, or key terms extracted from the original natural language question asked by the searcher, etc.), a list of data entities (e.g., table schema generated based on key terms from the query string, etc.), a grid range from an existing data table, or the like.

The backend server may run using a java environment. It may pass the query request to a series of modules such as a get-answer action module, an entity list extractor, an analytics module, a query interpreter, a table detector, or the like.

The get-answer action module may act as a communication interface receiving the client request, which may include query parameters such as a query string (e.g., the question asked by the searcher, etc.), a grid range of the data table detected in and around cell selection, or the like.

If the request has reached the server, the grid range may contain a constructed table. If a data table is not detected or the selected grid range contains any data, the answer panel interface may not go to a searcher initially.

The get-answer action module may send the grid range information to the entity list extractor to get a table view of the data entity list based on the grid range information, e.g., a sub-table with columns rows defining relevant data entities.

How The entity List Extractor May Construct a Table Schema

The entity list extractor may construct a table schema, e.g., a data entity list including data entities relevant to the query.

The entity list extractor may get a table summary (e.g., including column headers, data types, labels column, cell metrics, and/or other properties) from the table detector.

The entity list extractor may also build a typed table from the grid range and pass it on to the table detector for summarization.

The entity list extractor may provide a table view that represents the data entity list.

The entity list may be in a data structure as a per-table knowledge graph, represented by graph nodes such as but not limited to dimensions, dimension filters, metrics, and/or the like.

Dimensions may include the header of a column whose values act as row keys (or labels) into the table.

For example, “Country” will be a dimension in a table with country names as labels or row keys).

Dimension filters may include values in the dimension column (row keys/label column).

For example, “India” “U.S.A” is the dimension filters for the dimension “Country.” Metrics may include all number columns taken as metrics or column values.

Generally, a searcher may look for metrics for a particular dimension filter (or label). For example, in the string “Population of India,” “Population” is a metric, and the dimension filter is “India” for dimension “Country.”

The entity list extractor may provide an entity list table view to the get-answer action module.

The entity list table view may extract metrics, dimensions, and dimension filters from the table summary.

For example, all column headers that correspond to cells with numeric values are metrics (e.g., a column header “population” is a metric as in the above example), all string and date/time column headers are dimensions (e.g., a column header “country,” a text string, is a dimension).

The values in these dimension columns are dimension filters (e.g., values under the column header “country” such as “The U.S.A.,” “India,” etc., are dimension filters).

How Metrics, Dimensions, and Dimension Filters Can be Applied

Other determinations of the metrics, dimensions, and dimension filters can be applied.

In addition, the entity list table view may serve to reverse lookup row and column indices given a dimension, metric, or dimension filter string, which may map parameters such as dimensions, metrics, dimension filters back to the grid row, and column indices during formula construction.

To allow this, the entity list table view may provide a metrics-to-column number map, a dimensions-to-column number map, and a dimension-filters-to-row-and-column pair map.

The table detector may extract information from a data table and generate a table summary, determining what entities in the table can generate a formula to derive the query term.

Tables can be generally represented as a common object, which stores the data in a table, the column headers and types of data in the columns, and derived facts about the data.

The table detector may extract information from a data table in several steps.

First, light parsing of cells and inference of column headers and data types may be performed.

For cells with numeric values between 1900 and 2100, the cells may be years instead of pure numeric values.

The table detector may then filter out spurious rows and columns, including but not limited to empty rows/columns, numeric columns with ID numbers, columns for taking notes, and/or the like.

The table detector may then add column-based statistics.

For example, for all column types, the number of missing or distinct values may be tracked.

The number of negative/positive/floats/zero values and the sum, standard deviation, monotonicity, and uniformity of the array may be recorded.

The table object created from the input table cell values from the data table may create an aggregate table.

Each column in the aggregate table may determine the number of unique values compared to the total values (e.g., the range of data values).

If the column is categorical (e.g., when the unique values in the column are a subset of the entire spectrum of data values), then the column may create an aggregated table.

For each categorical column, two aggregated objects may associate with the column.

A new “count” aggregated object may record information relating to the “count” of a unique value.

For example, each row of the object may represent a unique value. In each row, the first cell stores the unique value, and the second cell records the number of times that the respective unique value appears in the original categorical column.

A new “sum” aggregated object may record the total sum of each unique value in the original table.

For example, each row of the object represents a unique value, and each column of the object represents a categorical numeric column in the original table.

The value in each cell of the object represents a sum of unique values of all cells in the respective categorical column that contain the respective unique value (based on the respective row of the object).

The “count” and “sum” objects may be examples of objects for aggregation.

Or, average aggregation objects may use the count and sum of each unique value and may carry information from the original data table.

A Parse Requiest Sent From a Get Answer Action Module

The get-answer action module may also send a parse request, including data entity list information and query information, to the analytics module, generating a parse response.

The parse response may include a structured data table/spreadsheet query represented as the query in the protocol buffer.

The query interpreter may interpret returned query response to an executable formula string using the entity list table view passed on from the get-answer action module.

The query interpreter may include various comparable classes for formula builder, e.g., a particular formula builder may correspond to one type of formula.

Here a given set and count of fields in the query may correspond to only one formula, e.g., a query with exactly two metrics corresponds to a correlation formula.

For example, the query interpreter may invoke a variety of operations.

A Query Scoring Operation May be from an Example Operation

An example operation includes a query scoring operation, e.g., score query (the query in the protocol buffer), which returns a score, built by counting the number of fields of the input query in the protocol buffer it can consume, or returns a negative/zero score if the fields in the query in the protocol buffer are not enough to build a formula.

For example, if the input query in the protocol buffer having two-dimension filters and a dimension goes to a formula builder that requires at least one dimension filter and at least one dimension, the scoreQuery( ) operator may return a score of two. This would be one point for satisfying the at least one dimension need and one point for satisfying the at least one dimension filter need).

The two (non-zero scores) state that the parameters included in the query in the protocol buffer are enough for formula building.

A given query may have more than one formula builder that may return the same score. If another formula builder requires just two-dimension filters, the input query in the protocol buffer in the above example would have a score of two with this formula builder.

The query interpreter may then run a getFormula (query in the protocol buffer, EntityListTableView) operation, based on the input of the query and the entity list table view.

After determining that a query score is a positive number, the query interpreter may return a formula built by joining data in the input values query in the protocol buffer and EntityListTableView.

The Query Interpreter

The query interpreter may take in a list of formula builders available (injected) and may interpret the input query in the protocol buffer by first scoring each formula builder by the number of fields of the input query in the protocol buffer may consume.

This may filter out a set of formula builders that cannot understand the input query in the protocol buffer.

If there is at least one formula builder with a positive score in response to the input query in the protocol buffer, the formula builder with the highest score may map a formula.

In this way, the formula builder that consumes the maximum number of fields from the input query in the protocol buffer can construct the possible formula parses.

The query interpreter may be a class with many smaller formula builders plugged into it.

In this way, the query interpreter structure can be expandable with more formula builders.

When a different type of query gets returned, a new formula type can add to the formula builders without changing the existing formula builder.

When the get-answer action module receives a formula from the query interpreter, a JSON response including the formula may return the answer panel at the frontend (e.g., at the client-side).

The answer panel may then provide the formula to a formula preview calculator. That may, in turn, generate a result based on the formula. The answer panel may then provide the result to the searcher.

A Searcher Interface Diagram Showing the Answer Panel to Data Tables Natural Language Queries

An example mobile interface may show example mobile screens of the answer panel.

The answer panel may have an interface on a desktop computer. Such as a browser-based application.

A searcher can type a natural language question in the query box, e.g., “how’s the growth of monthly totals?”

The query box may show a query suggestion in response to the searcher entered question to help searchers better understand how to structure their own questions using precise terms.

The question intake at the query box may also automatically complete or correct typographical mistakes from the searcher-entered question. Besides, the data entities for the query can be auto-completed.

The query may display the same colors with relevant sections in a spreadsheet to show how key terms in the query relate to sections in the spreadsheet.

An answer may show, e.g., a statement containing a calculated result of the “monthly total.”

The answer may include a human-friendly interpretation of the answer in natural language, e.g., “for every week, monthly total increases by,” and the calculated result, “$1,500.”

The query “how’s the growth of monthly totals” may take various visualization formats.

For example, the generation of a chart may show different data plots over a period of time. For example, the monthly totals, commission income, sales of product and service income, etc., as related to the query question “growth of monthly total.”

The answer panel may further provide analytics of the data plots.

The answer screen may include a rating button, a “like” or “dislike” button, or a “thumbs up” or “thumbs down” button for the searcher to provide feedback to the answer to the original question asked.

Natural Language Queries in Spreadsheets at Google is an original blog post first published on Go Fish Digital.

What is Semantic SEO?

Bill Slawski — Thu, 18 Mar 2021 13:00:37 +0000

SEO has constantly been marketing in the framework of the Web. In SEO, pages rank highly in search results based on information retrieval scores using relevance and authority signals from backlinks from other pages and sites. Semantic SEO introduces site owners to consumers to make communication easier about the goods and services offered. Semantic SEO also focuses more on indexing real-world objects such as entities and facts and attributes associated with those entities. Because of this, consumers can find information easier and become customers easier, and learn more about those offers.

Related Content:

What is Semantic SEO?

Semantic SEO is the process of using related topics and entities to help search engines better understand content. Semantic SEO helps provide search engines with more context about a particular page and makes the content more comprehensive. A Semantic SEO search results page can include:

Knowledge panels
Search carousels filled with entities
Featured snippets that may answer questions about the entities in a query
Related questions (“People also ask” questions that may be like the featured snippet answers)
Related entities,
More

There is a lot of variety in Semantic SEO SERPs pages like there is in universal search SERPs pages in SEO.

When Did Semantic SEO Start at Google?

I found a copy of what appears to be the second patent filed at Google. Like Lawrence Page’s Provisional PageRank patent, Sergy Brin filed a provisional patent in 1999 for another algorithm he coined “Dual Iterative Pattern Relation Expansion.” In the post, I wrote about the patent, Google’s First Semantic Search Invention was Patented in 1999. This image is from Sergey Brin’s Patent: This provisional patent was filed at Google less than a year after the PageRank Algorithm. It would lead to Google starting the Framework Annotation Project, which saw many patents filed at Google, including one on a Browseable Fact Repository, which was managed in-house at Google, until it acquired MetaWeb and its volunteer-run knowledge base, Freebase.

Entities and Semantic SEO

Focusing on Entities can make a lot of sense. For example, a website about the City of Baltimore might contain information about the people who lived there, and it can also look at monuments left behind in their memories. It can also tell you about the famous churches and schools in the city and well-known buildings, places, and businesses. I started doing entity Optimization in 2005 and wrote about my efforts towards that in the post Entity Optimization: How I Came to Love Entities. Closely related is a more recent post that tells us that Google focuses more on entities in search results in a post about Clustering Entities in Google SERPs Updated. If you write about Baltimore, you would want to dig into the city’s history. You would pay attention to people and places that visitors might be most interested in finding out more details. Learn about the past that brought America its national anthem. Instead of optimizing a page for terms or phrases, write about the actual people, places, and things that tell visitors the information they might be most interested in. Let them know about facts associated with those entities. In the mid-2000s, Google had engineers working on a project known as the Annotation Framework. This project was from Andrew Hogue, who was also responsible for managing the acquisition of MetaWeb, with the volunteer directory Freebase. If Google could not build the technology behind entities, they could acquire it. You can learn more about those efforts by reading the Resume of Andrew Hogue. It includes information about what he was doing at Google during his career. He also created a Google Tech Talk video about what Google was doing at the time that is worth watching: The Structured Search Engine.

More Facts in Search Results Using Semantic SEO

Another aspect of Semantic SEO at Google is moving away from 10 blue links to search results filled with rich results, initially described in a Google Blog post by Ramanathan Guha. Also covered in the post Introducing Rich Snippets by Kavi Goel, Ramanathan V. Guha, and Othar Hansson. In 2012 Google expanded the information found in Freebase and gave us search results that provided more information about entities that appear in queries, or at least the entities that Google knows about and may have included in the knowledge graph. (See: How Google’s Knowledge Graph works) We can see more information about entities in knowledge panels, often following knowledge templates based on the type of entity those cover. We will often see sentiment-based reviews about those businesses for local business entities, and we may see query revisions related to the entities displayed in the panels. Knowledge panels often tell us query log information about other entities that “people also search for.”

Providing More Information about Entities by Site Owners Using Schema

In 2011, Google joined with other search engines to provide machine-readable information about entities that appear on pages on the Schema.org website. This way of sharing between search engines was like when they developed XML sitemaps earlier. Schema is a rapidly growing area of Semantic SEO, and more effort has gotten undertaken to update Schema with new Releases. It is possible to get involved in discussions about new and developing Schema by subscribing to the public-schemaorg@w3.org Mail Archives. SEOs can learn about Schema these days as part of Semantic SEO and study developing aspects of Schema. For example, showing stars in search results for products can increase clicks and are worth learning more about. I also keep an eye on new emails from the schema mailing group and recent revisions to the Schema as they come out.

Knowledge Growing On the Web in Semantic SEO?

Sergey Brin filed an early patent from Google in 1999 after Lawrence Page filed the PageRank patent a year earlier. In the post-Google’s First Semantic Search Invention, I wrote about Patented in 1999. This patent was about the DIPRE algorithm after “Dual Iterative Pattern Relation Expansion.” You can use it to find sites with information about specific books and attributes of those books, such as publications, which the publishers were, how many pages each had, and more. If a site had all of the books from the seed set in the patent, the algorithm told it to collect information about the other books it contained. At the start of last year (2020), Google filed a continuation patent to collect information about books to display in search results. This continuation patent is about all kinds of entities and not only books anymore. My post about this patent is Rich Results Patent from Google Moves on from Only Books, and it provides more details about collecting facts about entities than the 2009 Google blog post about rich results.

Google is Focusing on Learning Directly from Pages on the Web

Google patents also tell us how Google might collect information about entities directly from the Web. One of the most detailed about using natural language processing to collect speech information and entity recognition to build triples (Subject/Verb/Object) about those entities. For more details on how Google may do that, see: Entity Extractions for Knowledge Graphs at Google Besides the entity extraction approach described in that patent, another extraction approach uses data wrappers. Google has been doing with name, address, and phone information for local search as long as Google Maps has been around. This approach gets used for sites that collect data about different entities. Examples are television episodes and songs from musical artists and business addresses. This is another way that Google uses Semantic SEO to better learn about facts that it might show as answers to queries from searchers. I wrote about this in the post Extracting Entities with Automated Data Wrappers.

Expanding Meanings for Answers by Rewriting Queries

SEOs have been doing Semantic SEO for almost as long as there has been SEO. We don’t just optimize pages for keywords. We optimize pages for meanings because the search engines can understand that we create pages related to the definitions in a query that a searcher might perform. In 2003 Google started rewriting queries by replacing synonyms for words. Google develops more sophisticated ways of substituting synonyms using the Hummingbird approach, which I wrote about when it came out in the post: The Google Hummingbird Update and the Likely Patent Behind Hummingbird. That patent came out a few weeks before Google announced it on Google’s 15th Birthday.

Semantic SEO Uses New Technology, Such as Language Models

Understanding Semantic SEO means knowing Google’s technologies and approaches to extract entity information from the Web. It also means building knowledge graphs using that technology. Google acquired Wavii in 2013, which used an open domain information extraction process they described in videos and papers. I wrote about it in With Wavii, Did Google Acquire the Future of Web Search? Since then, we have seen Google crawling the Web, trying to learn about everything it can. This effort has used Word Vectors and natural language Processing language models such as BERT and MUM. Google later told us about using an artificial language approach that used a Word Vectors Approach to rewrite ambiguous queries and expand them with missing words in those queries. These queries could capture missing meanings and answers to queries that Google had had difficulties with before. I linked citations behind the Word Vectors Approach in the post Citations behind the Google Brain Word Vectors Approach. Google has also expanded Natural Language Processing using pre-training language models such as BERT in several papers from Google in the past few years. I wrote about how a Word Vector approach (like in Rankbrain) was used for question-answering in the post Question Answering Using Text Spans With Word Vectors. Here is an example of a search result where Hummingbird is replacing the word “place” with the word “restaurant: “replacing the word “place” with the word “restaurant:” This result is an example of real-world entities and understanding the meaning behind the words in a query. It is about how SEO is becoming Semantic SEO.

SERPs Augmented with Knowledge Results

A few years ago, Google Paul Haahr presented at SMX West in How Google Works: A Google Ranking Engineer’s Story: A Google patent that came out after that presentation told us that Google would look in queries for entities, as Paul Haahr told us. And, if Google finds an entity in that query, it may decide to augment the search results with knowledge-based results. Again, a Semantic SEO approach to search. I wrote about this patent in Augmented Search Queries Using Knowledge Graph Information. This can cause a search result for a query to show a knowledge panel for the entity in the query. Related questions about similar entities may be asked and answered in the search results for that query, and associated entities may be shown in those results as well. This is a relatively common site in Semantic SEO, where we are moving away from search results that are filled with ten blue links and see much more colorful and clickable SERPs.

See People Also Ask Queries and Ontologies in Image Search Labels

So again, Google shows us that Semantic SEO focuses on finding real-world objects in queries that searchers perform. Knowledge results can include featured snippets, answering many people’s questions about those entities. Those results can also have other questions, often referred to as “related questions” or “people also ask” questions. They may find those related questions to crowdsource them by looking in query logs for related questions in a question graph. I wrote about those in Google Related Questions now use a Question Graph. I pointed out the labels that Google uses for categories to my co-workers. Google tells us in a blog post that they associate machine ID numbers with entities in image search. This is at Improving Photo Search: A Step Across the Semantic Gap Categories in image search show related entities and concepts in an ontology-based on your search terms. I detailed this in Google Image Search Labels Becoming More Semantic? If you are doing keyword research for pages and want to better understand related entities and concepts, search for those terms on Google Image search. Labels can tell you about entities and terms related to people, places, and things.

Learn About Entities and Personalized Knowledge Graphs from Google

An open-access book (free) by a computer scientist who worked as a visiting scholar at Google. I highly recommend reading it. The author is Krisztian Balrog, and his book is Entity-Oriented Search. It covers the use of entities at Google quite well. He also wrote a paper on entities for a conference worth looking at, Personal Knowledge Graphs: A Research Agenda. The idea of personalized knowledge graphs was exciting. I ran across a couple of Google patents that covered that area and wrote about them. These are those posts:

11/25/19 – User-Specific Knowledge Graphs to Support Queries and Predictions
8/31/20 – A Personalized Entity Repository in the Knowledge Graph

Semantic Topic Models in Semantic SEO

In 2006, I wrote about Anna Lynn Patterson’s Phrase-Based Indexing in Move over PageRank: Google is Using Phrase-Based Searching?. I expanded on this many times over the years. Google has around 20 related patents on different aspects of phrase-based indexing. I added a post called Thematic Modeling Using Related Words in Documents and Anchor Text, which shows off how frequently reappearing co-occurring phrases tend to be very predictive about the pages those are about. A couple of years later, I wrote about a continuation patent for Phrase-based indexing that turned it from a reranking approach to a direct ranking approach: Google Phrase-Based Indexing Updated. This can make phrase-based indexing very valuable.

Answering Queries using Knowledge Graphs

In Answering Questions Using Knowledge Graphs, I wrote about association scores giving different weights to tuples about entities, and they use their sources to provide them with significance. I also write about how Google might receive a query and create a knowledge graph to provide answers. This is in the patent application Natural Language Processing With An N-Gram Machine. I provide several examples of search carousels. Those show entities that answer queries like that patent application in the post Ranked Entities in Search Results at Google. A search carousel with ranked entities appears in SERPs for a query at Google, such as “Best Science Fiction Books 2020.” These books are from query results filled with entities shown in a carousel in a ranked order. Knowledge graphs can show personalized results for people and answer questions. I covered that in the post-User-Specific Knowledge Graphs to Support Queries and Predictions.

Answering Questions Using Automated Assistants

One of the newest technologies to come from Google is The MultiTask Unified Model, or MUM. Google published a patent that describes the interactions between humans and its automated assistant using the MUM Technology, which I wrote about in Google MUM Update. MUM is supposedly 100)X More Powerful than BERT. And is at the heart f answering questions at Google.

Semantic Keywords in Semantic SEO and No LSI Keywords

Under Semantic SEO, the search will collect real-world information about you and answer relevant questions. Semantic SEO brings us to a world that includes intelligent devices during the internet of things. This can mean more brilliant cars and kitchen devices and email connections with many others worldwide. I recently wrote about semantic keywords and the possibility of a new tool from Google that suggests such keywords. More about keywords and that new tool in this post: Semantic Relevance of Keywords. In that post, I mention LSI Keywords and how they likely are not used by search engines such as Google. I researched LSI to learn about where it came from and whether or not Google was using Latent Semantic indexing. After finding the Bell Labs patent behind LSI and that there was no sign of LSI Keywords from the people who invented LSI, I wrote Does Google Use Latent Semantic Indexing (LSI)? The answer is that it is unlikely that Google uses LSI at all and never used LSI Keywords either. In 2021, I wrote about how users of mobile devices could have their queries rewritten following a user-specified knowledge graph built from application data on that device in a way that may even be transparent to those searchers. It could better use that data and provide answers that take advantage of more personalized results. That post is at Rewritten Queries and User-Specific Knowledge Graphs.

The Web Becoming Ready for Semantic SEO?

In 2001, Tim Berners-Lee, James Hendler, and Ora Lassila wrote the Semantic Web for Scientific American. The sharing of information and data collection described in that paper tells us about the future of Semantic SEO that many places such as Google work towards. A more semantic web doesn’t stuff pages with synonyms or semantically relevant words. It is, as Tim Berners-Lee wrote:

The Semantic Web is not a separate Web but an extension of the current one, in which information is given well-defined meaning, better-enabling computers and people to work in cooperation.

I am coming across more semantic indexing of locations in places such as Google Maps, which I wrote about in An Image Content Analysis And A Geo-Semantic Index For Recommendation,s and in the post-Detecting Brand Penetration Over Geographic Locations; both of these shows a much stronger level of image analysis that can tell searchers more about the places that surround them. I thought it was worth sharing some of the things I have seen from the patent office, also on the pages of Google about search becoming more semantic. Most pages ranking in Google for terms such as “Semantic SEO” are filled with synonyms and a lack of understanding about how semantic technology might work. They also mention technology from the 1980s and fail to cover knowledge graphs or Schema. That is a problem that those don’t cover those topics. Last Updated: 1/10/2022

What is Semantic SEO? is an original blog post first published on Go Fish Digital.

SEO Turns to Data Graphs to Learn About the Web

Bill Slawski — Mon, 22 Feb 2021 17:51:47 +0000

The Web as Data Graphs is a New Direction for SEO

Many of the articles that people writing about SEO involve web pages and links between pages. Still, this post is about entities and relationships between entities and facts written about on web pages. It also looks at responses to queries from data graphs on the web about facts and attributes related to entities found on web pages. I recently came across a patent filing on the WIPO (World Intellectual Property Organization) site that I thought was worth writing about. The patent starts by telling us that it is about:

Large data graphs store data and rules that describe knowledge about the data in a form that provides for deductive reasoning.

The title for the patent tells us that it is ideally about submitting queries to a search engine in natural language (the way people talk and computers try to understand.)

Related Content:

The patent shows us an example related to data graphs, entities, such as people, places, things, concepts, etc., which may be stored as nodes The edges between those nodes may indicate the relationship between the nodes (facts that people can find out about those entities. In SEO, we are used to hearing about web pages and nodes and links between those pages as edges.

This approach to entities is a different way of looking at nodes and edges. We have most recently seen people talking about mentions of entities in place of links that mention pages. It is one way that SEO is moving forward to think about real-world objects such as entities when talking about a large database such as the web. The second patent from Google (a provisional one) that I am aware of was about facts and a large database. I wrote about it in Google’s First Semantic Search Invention was Patented in 1999.

I wrote about a more recent patent at Google on how the search engine may read the web and extract entity information from it and use the Web as a large scattered database. That post is Entity Extractions for Knowledge Graphs at Google. We have seen information online about pre-training programs such as BERT that can tag words in a document with parts of speech. It can also identify and recognize entities extracted from pages and learned about by the search engine.

This newest patent tells us that in such data graphs, nodes such as “Maryland” and “United States” could be linked by the edges of “in-country” and/or “has stated.”

We are also told that the basic unit of such data graphs is a tuple that includes two entities and a relationship between the entities.

Those tuples may represent real-world facts, such as “Maryland is a state in the United States.”

The tuple may also include other information, such as context information, statistical information, audit information, etc.

Adding entities and relationships to a data graph has typically been a manual process, making large data graphs difficult and slow.

And the difficulty in creating large data graphs can result in many “missing” entities and “missing” relationships between entities that exist as facts but have not yet been added to the graph.

Such missing entities and relationships reduce the usefulness of querying data graphs.

Some implementations extract syntactic and semantic knowledge from text, such as from the Web, and combine this with semantic knowledge from data graphs.

Building Confidence About Relationships Between Entities and Facts

Association Scores measure confidence in relationships between multiple entities, between entities & attributes for those entities, & between entities & classifications for those entities. These are generated when Google extracts entity information from text on the Web. 1/2

— Bill Slawski (@bill_slawski) June 25, 2020

The knowledge extracted from the text and the data graph is used as input to train a machine-learning algorithm to predict tuples for the data graph.

The trained machine learning algorithm may produce multiple weighted features for a given relationship, each feature representing an inference about how two entities might be related.

The absolute value of the weight of a feature may represent the relative importance in making decisions. Google has pointed out in other patents that they are measuring confidence between such relationships and are calling those weights “association scores.”

The trained machine learning algorithm can then create additional tuples from a data graph from analysis of documents in a large corpus and the existing information in the data graph.

This method provides many additional tuples for the data graph, greatly expanding those data graphs.

In some implementations, each predicted tuple may be associated with a confidence score, and only tuples that meet a threshold are automatically added to the data graph.

Facts represented by the remaining tuples may be manually verified before being added to data graphs.

Some implementations allow natural language queries to be answered from data graphs.

The machine learning module can map features to queries and the features being used to provide possible query results.

The training may involve using positive examples from search records or query results obtained from a document-based search engine.

The trained machine learning module may produce multiple weighted features, where each feature represents one possible query answer, represented by a path in the data graph.

The absolute value of the weight of the feature represents the relative importance in making decisions.

Once the machine learning module has been properly trained with multiple weighted features, it can respond to natural language queries using information from the data graph.

A computer-implemented method includes receiving a machine learning module trained to produce a model with multiple weighted features for a query. Each weighted feature represents a path in a data graph.

The method also includes receiving a search query that includes a first search term, mapping the search query to the query, mapping the first search term to a first entity in the data graph, and identifying a second entity in the data graph using the first entity and at least one of the multiple weighted features.

The feature may also include providing information relating to the second entity in response to the search query.

The query may be a natural language query.

As another example, the method may include training the machine learning model to produce the model, which is the focus of this patent.

Obtaining Search Results From Natural Language Queries From a Data Graph

Training the machine learning module may include generating noisy query answers and generating positive and negative training examples from the noisy query answers.

Generating the noisy query answers may include obtaining search results from a search engine for a document corpus, each result having a confidence score and generating the training examples can include selecting a predetermined number of highest scored documents as positive training examples and selecting a predetermined number of documents with a score below a threshold as negative training examples.

Obtaining search results can include reading search results from search records for past queries.

Generating positive and negative training examples can include performing entity matching on the query answers and selecting entities that occur most often as positive training examples.

The method may also include determining a confidence score (like the association scores referred to above) for the second entity based on the weight of at least one weighted feature.

Identifying the second entity in the graph may also include selecting the second entity based on the confidence score, and determining the confidence score for the second entity may include determining that two or more features connect to the second entity and using a combination of the weights of the two or more features as the confidence score for the second entity.

A computer-implemented method includes training a machine learning module to create multiple weighted features for a query and request the query.

The method also includes determining a first entity from the request for the query, the first entity existing in a data graph having entities and edges, and providing the first entity and the query to the machine learning module.

This method may also include receiving a subset of the multiple weighted features from the machine learning module; and generating a response to the request that includes information obtained using the subset of the multiple weighted features.

These can include one or more of the following features. For example, training the machine learning module can include:

Selecting positive examples and negative examples from the data graph for the query
Providing the positive examples, the negative examples, and the data graph to the machine learning module for training
Receiving the multiple weighted features from the machine learning module, each feature representing a walk in the data graph
Storing at least some of the multiple weighted features in a model associated with the query

Some of the features that this process will follow can include limiting a path length for the features to a predetermined length, the path length is the number of edges traversed in the path for a particular feature, and/or the positive and negative examples are generated from the search records for a document-based search engine.

The multiple weighted features may exclude features occurring less than a predetermined number of times in the data graph.

Generating the response to the query can include determining a second entity in the data graph with the highest weight and including information from the second entity in the response.

The weight of the second entity can be the sum of the weight of each feature associated with the second entity. Thus, the query can represent a cluster of queries.

Also, a computer system can include a memory storing a directed edge-labeled data graph constructed using tuples, where each tuple represents two entities linked by a relationship, at least one processor, and memory storing instructions that, when executed by at least one processor, can cause the computer system to perform operations.

Those operations can include:

Receiving query
Generating query answers for the query
Generating positive and negative training examples from the query answers
Providing the positive examples, the negative examples, and the data graph to a machine learning module for training

The operations may also include receiving a plurality of features from the machine learning module for the query and storing the plurality of features as a model associated with the query in the machine learning module.

The following features should be used: weighted features and the query being a natural language query.

The number of features can also exclude features that occur less than a predetermined number of times in the data graph and features with a probability of reaching a correct target that falls below a predetermined threshold.

As part of generating query answers, the instructions, when executed by the at least one processor, can:

Cause the computer system to identify a query template for the query
Examine search records for queries matching the query template
Obtain search results from the search records for queries matching the query template

As part of generating positive and negative training examples, the instructions:

Cause the computer system to extract a source entity from a query in the search records that matches the query template
Extract entities from the search results of the query that matches the query template
Determine the number of times a target entity appears in the search results of the query that matches the query template
Use the source entity and the target entity as a positive training example if the number of times meets a threshold

The features may be weighted.

Each of the features can have its own associated weight.

A feature can be a path through the data graph with an associated confidence score. The path may represent a sequence of edges in the data graph.

The patent tells us about the following advantages from using the process in the Querying Data Graph patent

1. Implementations may automatically extend a data graph by reading relational information from a large text corpus, such as documents available over the Internet or other corpora with more than a million documents, and combine this information with existing information from the data graph
2. Such implementations can create millions of new tuples for a data graph with high accuracy
3. Some implementations may also map natural language queries to paths in the data graph to produce query results from the data graph
4. One difficulty with natural language queries is finding a match between the relationships or edges in the data graph to the query
5. Some implementations train the machine learning module to perform the mapping, making natural language querying of the graph possible without a manually entered synonym table that can be difficult to populate, maintain and verify

exhaustively

This patent can be found here:

Querying a Data Graph Using Natural Language Queries
Inventors Amarnag Subramanya, Fernando Pereira, Ni Lao, John Blitzer, Rahul Guptag
Applicants GOOGLE LLC
US20210026846
Patent Filing Date October 13, 2020
Patent Number 20210026846
Granted: January 28, 2021

Abstract

Implementations include systems and methods for querying a data graph. An example method includes receiving a machine learning module trained to produce a model with multiple features for a query, each feature representing a path in a data graph.

The method also includes receiving a search query that includes a first search term, mapping the search query to the query, and mapping the first search term to a first entity in the data graph.

The method may also include identifying a second entity in the data graph using the first entity and at least one of the multiple weighted features and providing information relating to the second entity in response to the search query.

Some implementations may also include training the machine learning module by, for example, generating positive and negative training examples from an answer to a query.

Understanding a Data Graph Better

A syntactic-semantic inference system as described in the patent with an example implementation.

This system could be used to train a machine learning module to recognize multiple weighted features or walks in the data graph, to generate new tuples for the data graph based on information already in the graph and/or based on parsed text documents, as I examine in the Entity Extraction patent I linked to above or another patent on Knowledge graph reconciliation that I have also written about.

The system can work to generate search results from the data graph from a natural language query.

This patent describes a system that would use documents available over the Internet.

But, we are told that other configurations and applications may be used.

These can include documents originating from another document corpus, such as internal documents not available over the Internet or another private corpus, from a library, books, corpus of scientific data, or other large corpora.

The syntactic-semantic inference system may be a computing device or device that takes the form of several different devices, for example, a standard server, a group of such servers, or a rack server system.

The syntactic-semantic inference system may include a data graph. The data graph can be a directed edge-labeled graph. Such a data graph stores nodes and edges.

The nodes in the data graph represent an entity, such as a person, place, item, idea, topic, abstract concept, concrete element, another suitable thing, or any combination of these.

Entities in the data graph may be related to each other by edges representing relationships between entities.

For example, the data graph may have an entity that corresponds to the actor Kevin Bacon. In addition, the data graph may have acted in the relationship between the Kevin Bacon entity and entities representing movies that Kevin Bacon has acted in.

A data graph with many entities and even a limited number of relationships may have billions of connections.

In some implementations, data graphs may be stored in an external storage device accessible from the system.

In some implementations, the data graph may be distributed across multiple storage devices and/or multiple computing devices, for example, multiple servers.

The patent provides more details about confidence scoring of facts, parts of speech tagging of words in a corpus, entity extraction.

It specifically looks at Miles Davis, John Coltrane, and New York and uses coreference resolution to understand pronouns in documents better.

A text graph generated according to the patent may also be linked to the data graph.

The patent tells us that linking may occur through entity resolution or determining which entity from the data graph matches a noun phrase in a document.

We are returned to the idea of using mentions in SEO with statements like this from the patent:

Matches may receive a mention link between the entity and the noun phrase, as shown by links and 210′ of FIG. 2.

This is different from the links we see in HTML but is worth keeping an eye on. The patent tells us about the relationships between nodes and edges like this in a data graph:

Edge represents an edge from the data graph entity to the noun phrase in the document. Edge′ represents the reverse edge, going from the noun phrase to the entity.

Thus, as demonstrated in FIG. 2, the edges that link the data graph to the text graph may lead from the entity to the noun-phrase in a forward direction and from the noun-phrase to the entity reverse direction.

Of course, forward Edge may have a corresponding reverse edge, and reverse Edge′ may have a corresponding forward edge, although these edges are not shown in the figure.

The patent describes the use of confidence scores and features weight for trusting in entities using queries like this one, where we are told about training using this system:

In some implementations, the training engine may be configured to use a text graph generated by the syntactic-semantic parsing engine from crawled documents linked to the data graph to generate training data for the machine learning module.

The training engine may generate the training data from random, path-constrained walks in the linked graph.

The random walks may be constrained by a path length, meaning that the walk may traverse to a maximum number of edges.

Using the training data, the training engine may train a machine learning module to generate multiple weighted features for a particular relationship, or in other words, to infer paths for a particular relationship.

A feature generated by the machine learning module is a walk-in-the-data graph alone or the combination of the data graph and text graph.

For instance, if entity A is related to entity B by edge t1, and B is related to entity C by edge t2, A is related to C by the feature {t1, t2}.

The feature weight may represent confidence that the path represents a fact.

The patent shows us a positive training example that teaches the machine learning algorithm to infer the profession of a person entity based on the professions of other persons mentioned in conjunction with the query person.

See the featured image on the first page of this blog which includes people and mentions to professions of those people (available below now too). The patent tells us that such a feature may appear as {Mention, conj, Mention −1, Profession}, where the Mentions represent the mentioned edge that links the data graph to the text graph, conj is an edge in the text graph, Mention −1 represents the mentioned edge that links the text graph to the data graph, and Profession is an edge in the data graph that links an entity for a person to an entity representing a profession.

We are then told in the patent:

If a person entity in the data graph is linked to a professional entity in the data graph by this path or feature, the knowledge discovery engine can infer that the data graph should include a professional edge between the two entities.

The feature may have a weight that helps the knowledge discovery engine decide whether or not the edge should exist in the data graph.

We also learn of examples with the machine learning module being trained to map the queries for “spouse,” “wife,” “husband,” “significant other,” and “married to” to various paths in the data graph, based on the training data.

Those queries may be clustered so that the machine learning module may be trained for clusters of queries.

And the queries may refer to a cluster of queries with similar meanings.

The patent provides many examples of how a data graph about several entities can be learned using the examples above. Such training can then be used to answer queries from the data graph. In addition, the patent tells us that it can use information from sources other than the internet, such as a document-based index, and may combine the results from the data graph with the results from the document-based index.

This patent also has a large section on how Google may expand a data graph. The process sounds much like the one I described when I wrote about entity extraction, which I linked to above. We are told that a data graph could involve learning from millions of documents.

The patent also has a section on associating inferred tuples with confidence scores using the Machine Learning module. It also tells us about checking the confidence score for the inferred tuples against a threshold.

Purpose of Querying a Data Graph Using Natural Language Queires

This patent tells us about how a data graph could be created to identify entities and tuples associated with those. It could build a data graph understanding confidence scores between those entities and facts related to them and understand similar entities with similar attributes. It would use those data graphs to answer queries about all of those entities. This approach would benefit from reading the Web and collecting information about entities and facts about them as it comes across them. I have summarized many aspects of the patent and recommend reading it to better learn about its details in more depth. Finally, I wanted to describe how it learns from the web it comes across and builds upon that knowledge to answer queries that people ask.

I suspect that we will come across many more patents that describe related approaches that a search engine might use to better understand the world through what it reads.

SEO Turns to Data Graphs to Learn About the Web is an original blog post first published on Go Fish Digital.

Google Knowledge Panel Interface Patent

Bill Slawski — Tue, 17 Nov 2020 19:29:48 +0000

Managing Knowledge Panels at Google

A patent from Google shows a planned approach for people from a business to manage a knowledge panel that may appear when someone searches for their business.

The patent tells us that “entity information can be provided on a webpage,” and visitors to search results can view details about that entity. The patent points out the example of a business entity, and tells us that entity information may be provided that includes:

An address of the business
User-provided reviews of the business
Pictures of the business

It provides some details of how such entity information might benefit someone associated with that entity, such as an owner, and that they can “access, edit, and/or update entity information that is stored by an entity listing portal interface on a search results page.”

Related Content:

When that associated person is logged in during a search session, they can provide a search query for the owner’s entity. The search engine will access entity information from an entity listing portal interface.

The search engine can

Determine that the search query is related to a particular entity
That the user that is providing the search query is the owner of the entity or other user that is closely associated with the entity

That closely associated person does not have to log into a separate system and can access entity information of the entity, which saves them time and computational resources and avoiding a separate, second authentication step.

We see tools to manage entity profile information in Google My Business for business entities, but having access to them directly in search results can be more convenient.

This patent is telling us that these are the actions taken based on the process from this patent:

Receiving a query from a searcher
Identifying an entity associated with that query
Determining that the searchers who submitted the query is associated with the entity
In response to determining that the searcher submitted the query is associated with the entity, providing search results that include
Search results responsive to the query and an interface which allows edits to data associated with the entity

An advantage of the entity interface described in this patent is that a searcher who has been given the ability to update entity information in a knowledge panel such as updating address and contact information, business hours, add posts, and more, would not need to have to log into another interface such as Google My Business.

This patent can be found at:

Generation of enhanced search results
Inventors: Ram Brijesh Jagadeesan, Camille McMorrow, and Ranjith Jayaram
Assignee: Google LLC
US Patent: 10,831,845
Granted: November 10, 2020
Filed: April 30, 2018

Abstract

Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for receiving, at a search engine, a search query submitted by a user to the search engine; identifying an entity that is associated with the search query; determining that the user that submitted the search query to the search engine is associated with the entity in an entity listing portal; and in response to determining that the user that submitted the search query to the search engine is associated with the entity in an entity listing portal, providing a search results webpage that includes i) search results that the search engine generated responsive to the search query and ii) an interface through which edits to data associated with the entity within the entity listing portal can be provided.

Take Aways about this Knowledge Panel Interface

Presently on my phone, if I search for my name. I am shown an interface that allows me to make suggestions to update the knowledge panel for my Knowledge Panel. Of course, Google wants to review any possible changes before finalizing them, but this interface does exist. This is the screen that shows the knowledge panel interface. This seems to be the interface that isn’t necessarily related to a local business entity that has been verified at Google My Business. Google maintains more control this way over an entity that hasn’t been through the Google My Business verification process by accepting suggestions for changes rather than just changes.

If I search for SEObythesea, I get a slightly different knowledge panel interface that allows me to edit my business profile, change my hours, or add photos. There is a “promote” section that allows me to review performance, Advertise, Add Photos, Add an Update or Offer or Event. There is a “Customers” section that lets me look at Reviews (and respond to them) and look at Questions, and provide answers to them as well. As noted in that interface, only managers of the profile can see it.

So there are knowledge panel interfaces for entities that you may be associated with.

These don’t impact how a site may rank in search results. Still, they show how Google is trying to manage their site to enable people who manage information about entities to use a knowledge panel interface to update the information shown in those panels – and it can be important that Google contains up-to-date information for searchers. In addition, they allow more control to people who have been identified as managing a business entity.

I like patents that describe some of the nuts and bolts operations behind the scenes at Google. For example, this shows us strategies in place behind how Google keeps knowledge panels updated.

Google Knowledge Panel Interface Patent is an original blog post first published on Go Fish Digital.

Selecting Candidate Answer Passages

Bill Slawski — Fri, 23 Oct 2020 01:57:16 +0000

Ranking Passages at Google

The Google I/O Developer’s conference was called off this year, and a SearchOn2020 presentation from the company was held instead. Several announcements were made about new features at Google, many of which were described in a blog post on the Google blog on How AI is powering a more helpful Google. In that post, we were told about how Google may be ranking passages soon:

Passages
Precise searches can be the hardest to get right since sometimes the single sentence that answers your question might be buried deep in a web page. We’ve recently made a breakthrough in ranking and cannot just index web pages but individual passages from the pages. By better understanding the relevancy of specific passages, not just the overall page, we can find that needle-in-a-haystack information you’re looking for. This technology will improve 7 percent of search queries across all languages as we roll it out globally. With new passage understanding capabilities, Google can understand that the specific passage (R) is a lot more relevant to a specific query than a broader page on that topic (L).

Google has told us that Passages and Featured Snippets are different, but the example of a passage that Google shows on their blog post about passages looks exactly like a featured snippet. We are uncertain what the differences between the two may be, and we have received information, as Barry Schwartz has contacted Google about such differences. He wrote about them in Google: Passage Indexing vs. Featured Snippets.

Related Content:

I have been writing about several patents from Google which have the phrase “answer passages” in their titles. I have referred to these answer passages as featured snippets, and I believe that they are. However, it doesn’t seem very clear how much the passages that Google is writing about are different from featured snippets. The chances are that we will be provided with more details within the next couple of months. In addition, we hear from Google how they process passages can result in them understanding content that they find on pages better.

One of these patents that I wrote covered an updated version of one of those patents and focused on the changes between the first version of the patent and the newer version. Still, I didn’t focus on the process described in that patent in detail and provide a breakdown of it. It has much information on how answer passages are selected on pages and whether they use structured or unstructured data in answer passages.

After hearing of Google’s plans to focus more on ranking passages, it made sense to spend more time on this Google about answer passages titled Scoring candidate answer passages.

Before then, I would like to provide links to some of the other posts I’ve written about Google patents involving answer passages.

Google Patents about Answer Passages

When I first came across these patents, I attempted to put them into context. It appeared that they were aimed at providing answers to queries that asked questions and responded to those questions with textual passages that were more than just facts that were responsive to question-queries. They all tend to describe being used in that way as answers that could be shown in an answer box when Google returns a set of search results for a query, which might be placed at the start of a set of organic search results.

These are the patents and posts that I wrote about them.

Candidate Answer Passages. This is a look at what answer passages are, how they are scored, and whether they are taken from structured or unstructured data. My post about this patent was Does Google Use Schema to Write Answer Passages for Featured Snippets?

Scoring Candidate Answer Passages. This post is a recent update to a patent about how answer passages are scored. The updated claims in the patent emphasize that both query dependent and query independent signals are used to score answer passages. This updated version of this patent was granted on September 22, 2020. The post I wrote about this update was Featured Snippet Answer Scores Ranking Signals. I focused on the query dependent and query independent of how the process behind this patent worked.

Context Scoring Adjustments for Answer Passages This patent tells us about how answer scores for answer passages might be adjusted based on the context of where they are located on a page – it goes beyond what is contained in the text of an answer to what the page title and headings on the page tell us about the context of an answer. May post based on this patent was Adjusting Featured Snippet Answers by Context.

Weighted answer terms for scoring answer passages was the last post I wrote about answer passages. It was granted in July of 2018. It tells us how it might find questions from the query on pages and textual answers to answer passages. My post about it was Weighted Answer Terms for Scoring Answer Passages

When I wrote about the second of the patents I listed, I wrote about the updated version of the patent. So, instead of breaking down the whole patent, I focused on how the update told us about it being about more than query-dependent ranking signals and that query independent ranking signals were also important.

But, the patent provides more details on what answer passages are. After reviewing it again, I felt it was worth filling out those details, especially in light of Google telling us that what they have written about passages may have more importance in understanding content on pages.

The process Google might use to answer question-seeking queries may enable Google to understand the meaning of the content that it is indexing on pages. Let’s break down “Scoring Candidate Answer Passages” and see how it adds the creation of answer passages and the criteria for unstructured and structured answer passages (possibly the most detailed list of such criteria I have seen.)

Scoring Candidate Answer Passages

At the heart of these answer passages patents are sentences like these:

Users of search systems often search for an answer to a specific question rather than a listing of resources. For example, users may want to know what the weather is in a particular location, a current quote for a stock, the capital of a state, etc. When queries in the form of a question are received, some search engines may perform specialized search operations in response to the question format of the query. For example, some search engines may provide information responsive to such queries in the form of an “answer,” such as information provided in the form of a “one box” to a question.

Some question queries are better served by explanatory answers, which are also referred to as “long answers” or “answer passages.” For example, for the question query [why is the sky blue], an answer explaining Rayleigh scatter is helpful. Such answer passages can be selected from resources that include text, such as paragraphs relevant to the question and the answer. Sections of the text are scored, and the section with the best score is selected as an answer.

The patent tells us that it follows a process that involves finding candidate answer passages from many resources on the web. Those answer passages contain terms from the query that identify them as potential answer responses. They meet a query term match score that is a measure of similarity of the query terms to the candidate answer passage. This Scoring Candidate Answer Passages patent provides information about how Answer passages are generated.

The patent also discusses a query dependent score for each candidate answer passage based on an answer term match score (a measure of similarity of answer terms to the candidate answer passage) and a query independent score.

We are told that one advantage of following the process from this patent is that candidate answer passages can be generated from both structured content and unstructured content – answers can be prose-type explanations and can also be a combination of prose-type and factual information, which may be highly relevant to the user’s informational need.

Also, query dependent and query independent signals are used (and I wrote about this aspect of this patent in my last post this patent). The query-dependent signals may be weighted based on the set of most relevant resources, which may tend to answer passages that are more relevant than passages that are scored on a larger corpus of resources. Because they are based on a combination of query dependent and query independent signals can reduce processing requirements and more easily facilitates a scoring analysis at query time.

This patent can be found at:

Scoring candidate answer passages
Inventors: Steven D. Baker, Srinivasan Venkatachary, Robert Andrew Brennan, Per Bjornsson, Yi Liu, Hadar Shemtov, Massimiliano Ciaramita, and Ioannis Tsochantaridis
Assignee: Google LLC
US Patent: 10,783,156
Granted: September 22, 2020
Filed: February 22, 2018

Abstract

Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for scoring candidate answer passages. In one aspect, a method includes receiving a query determined to be a question query that seeks an answer response and data identifying resources determined to be responsive to the query; for a subset of the resources: receiving candidate answer passages; determining, for each candidate answer passage, a query term match score that is a measure of similarity of the query terms to the candidate answer passage; determining, for each candidate answer passage, an answer term match score that is a measure of similarity of answer terms to the candidate answer passage; determining, for each candidate answer passage, a query dependent score based on the query term match score and the answer term match score; and generating an answer score that is a based on the query dependent score.

Generating Answer Passages

This answer passage process starts with a query that is a question query that seeks an answer response and data identifying resources that are responsive to the query.

The answer passage generator receives a query processed by the search engine with data showing responsive results. Those results are ranked based on search scores generated by the search engine.

The answer passage process identifies passages in resources.

Passages can be:

A complete sentence
A portion of a sentence,
A header, or content of structured data
A list entry
A cell value

As an example, passages could be headers and sentences, and list entries.

A number of appropriate processes may be used to identify passages, such as:

Sentence detection
Mark-up language tag detection
Etc.

Passage Selection Criteria

The answer passage process applies a set of passage selection criteria for passages.

Each passage selection criterion specifies a condition to be included as a passage in a candidate answer passage.

Passage selection criteria may apply to structured content and then also to unstructured content.

Unstructured content is content displayed in the form of text passages, such as an article, and is not arranged according to a particular visual structure that emphasizes relations among data attributes.

Structured content is content displayed to emphasize relations among data attributes, such as lists, tables, and the like.

While a resource may be structured using mark-up language (like HTML), the usage of the terms “structured content” and “unstructured content” refers to the visual formatting of content for rendering and concerning whether the arrangement of the rendered content is following a set of related attributes (e.g., attributes defined by row and column types in a table, and where the content is listed in various cells of the table.)

An answer passage generator generates, from passages that satisfy the set of passage selection criteria, a set of candidate answer passages.

Each of the candidate answer passages is eligible to be provided as an answer passage in search results that identify the resources that have been determined to be responsive to the query but are shown separate and distinct from the other search results, as in an “answer box.”

After answer passages are generated, an answer passage scorer is used to score each passage.

An Answer passage may be generated from unstructured content. That means the answer might be textual content from a page, like in this drawing from the patent, showing a question with a written answer:

That answer passage is created from unstructured content, based on a query [How far away is the moon].

The query question processor identifies the query as a question query and also identifies the answer 208 “289,900 Miles (364,400 km).”

The answer box includes the answer.

The answer box also includes an answer passage that has been generated and selected by the answer passage generator and the answer passage scorer.

The answer passage is one of several answer passages processed by the answer passage generator and the answer passage scorer.

Search results are also provided on the search results page. The search results are separate and distinct from the answer passage that shows up in the answer box.

The patent shows a drawing of an example page from where the answer passage has been taken.

This web page resource is one of the top-ranked resources responsive to the query [How far away is the moon] and was made by the answer passage generator, which generates multiple candidate answer passages from the content of ranking resources. This particular resource includes multiple headings.

Headings have respective corresponding text sections that are subordinate. I wrote about how headings can give answer passages context and be used to adjust the scores of passages in the post Adjusting Featured Snippet Answers by Context. The post I wrote about the patent I covered in that post told us about how headings can adjust rankings of answer passages based on context. And the title for the page (About the Moon) is considered a root heading, which

On this example page are many example passages, and the patent points a few of those out as Candidate answer passages that might be used to answer a query about the distance from the Earth to the Moon:

(1) It takes about 27 days (27 days, 7 hours, 43 minutes, and 11.6 seconds) for the Moon to orbit the Earth at its orbital distance.
(2) Why is the distance changing? The moon’s distance from Earth varies because the moon travels in a slightly elliptical orbit. Thus, the moon’s distance from the Earth varies from 225,700 miles to 252,000 miles.
(3) The moon’s distance from Earth varies because the moon travels in a slightly elliptical orbit. Thus, the moon’s distance from the Earth varies from 225,700 miles to 252,000 miles.

The patent also tells us about answer passages that are made from structured content. The patent provides an example drawing of one that uses a table to answer a question:

The resource it is taken from includes the unstructured content and the table.

The patent points out that the table includes both columns and rows, expressing data the shows the relationship between airlines and baggage fees in terms of prices.

Structured and Unstructured Content Criteria for Answer Passages

We have seen that both structured data and unstructured data can be used in answer passages. The patent describes the criteria that are considered when creating those passages. The drawing above shows passages using unstructured data and structured data being chosen as candidate answer passages

An answer passage generator implements the process.

It begins with the selection of a passage for a candidate answer passage that is being generated.

For example, on the page above about the distance from the Earth to the Moon, a header or a sentence may be selected.

The type of passage selected – whether structured content or unstructured content may determine what types of criteria are applied to it.

Passages that use Unstructured Content Criteria

The following are some of the things that may be looked for when deciding on unstructured content criteria.

A Sentence Score indicates whether a passage is a complete sentence.

If it is not a complete sentence, it may be omitted as a candidate answer passage, or more content may be added to the passage until a complete sentence is detected.

A minimum number of words is also required when deciding on unstructured content criteria.

If a passage does not have a minimum number of words, it may not be chosen as a candidate answer passage, or more content may be added to the passage until the minimum number of words is reached in the passage.

Visibility of the content is also required when deciding on unstructured content criteria.

This visibility of content criterion may also be used for a passage using structured content.

If the content is the text that, when rendered, is invisible to a user, then it is chosen as a candidate answer passage.

That content can be processed to see if visibility tags would cause it to be visible and be considered a candidate answer passage.

Boilerplate detection may also cause a passage not be be considered.

Google may decide not to show boilerplate content as an answer passage in response to a query question.

This boilerplate criterion may also be used for structured content.

If the content is text determined to be boilerplate, it would not be included in a candidate answer passage.

Several appropriate boilerplate detection processes can be used.

Alignment detection is another unstructured content criterion reviewed

If the content is aligned so that it is not next to or near other content already in the candidate answer passage (like two sentences that are separated by other sentences or a block of whitespace), it would not be included in the candidate answer passage.

Subordinate text detection would be another type of unstructured content criterion used

Only text subordinate to a particular heading may be included in a candidate answer passage (the heading wouldn’t apply to it otherwise.)

Limiting a heading to only be a first sentence in a passage would be an unstructured content criterion used.

Image caption detection would also be an unstructured content criterion used.

A passage unit that is an image caption should not be combined with other passage units in a candidate answer passage.

Using structured content, and subsequent unstructured content added to it in a candidate answer passage would be an unstructured content criterion used.

If a candidate answer passage has a row from the table, then the unstructured content “555” cannot be added to follow the row in the candidate answer passage.

Additional Types of criteria for unstructured data can include:

A maximum size of a candidate answer passage
Exclusion of anchor text in a candidate answer passage
Etc.

If the process determines that the passage passes the unstructured content criteria, then the passage generation process includes the passage as unstructured content in a candidate answer passage being generated.

Passages that use Structured Content Criteria

If a passage does not meet unstructured passage criteria, then the answer passage process determines whether the passage unit passes structured content criteria.

Some structured content criteria may be applied when only structured content is included in the answer passage. Some structured content criteria may be applied only when there is structured and unstructured content in the answer passage.

Incremental list generation is one type of structured content criterion.

Passages are iteratively selected from the structured content. Only one passage unit from each relational attribute is selected before any second passage unit from a relational attribute is selected.

This iterative selection may continue until a termination condition is met.

For example, when generating the candidate answer passage from a list, the answer passage generator may only select one passage from each list element (e.g., one sentence.)

This ensures that a complete list is more likely to be generated as a candidate answer passage.

Additional sentences are not included because a termination condition (e.g., a maximum size) was met, thus precluding the inclusion of the second sentence of the first list element.

Generally, in shortlists, the second sentence of a multi-sentence list element is less informative than the first sentence. Thus emphasis is on generating the list in order of sentence precedence for each list element.

Inclusion of all steps in a step list is another type of structured content criterion.

If the answer passage generator detects structured data that defines a set of steps (e.g., by detecting preferential ordering terms), all steps are included in the candidate answer passage.

Examples of such preferential ordering terms are terms that imply order steps, such as “steps,” or “first,” “second,” etc.

If a preferential ordering term is detected, all steps from the structured content need to be included in the candidate answer passage.

If including all steps exceeds a maximum passage size, then that candidate answer passage is discarded.

Otherwise, the maximum passage size can be ignored for that candidate answer passage.

Superlative ordering is another type of structured content criterion.

When the candidate answer passage generator detects a superlative query in which a query inquires of superlatives defined by an attribute, the candidate answer passage generator selects, from the structured content for inclusion in the candidate answer passage, a subset of passages in descending ordinal rank according to the attribute.

For example, for the query [longest bridges in the world], a resource with a table listing the 100 longest bridges may be identified.

The candidate answer passage generator may select the rows for the three longest bridges.

Likewise, if the query were [countries with smallest populations], a resource with a table listing the 10 smallest countries may be identified.

The candidate answer passage generator may select the rows for the countries with the three smallest populations.

Informational question query detection is another type of structured content criterion.

When the candidate answer passage generator detects an information question query in which a query requests information set for many attributes, the candidate answer passage generator may select the entire set of structured content if the entire set can be provided as an answer passage.

For example, for the query [nutritional information for Brand X breakfast cereal], a resource with a table listing the nutritional information of that cereal may be identified.

The candidate answer passage generator may select the entire table to include in the candidate answer passage.

Entity attribute query detection is another type of structured content criterion.

When the candidate answer passage generator detects that a question query is requesting an attribute of a particular entity or a defined set of entities, a passage that includes an attribute value of the attribute of the particular entity or the defined set of entities is selected.

As an example, for the question query [calcium nutrition information for Brand X breakfast cereal], the candidate answer passage generator may select only the attribute values of the table that describe the calcium information for the breakfast cereal.

Key value pair detection is another type of structured content criterion.

When the structured content includes enumerated key-value pairs, then each passage must include a complete key-value pair.

Additional Types of criteria for unstructured data can include:

A maximum size of a candidate answer passage
Exclusion of anchor text in a candidate answer passage
Etc.

If this process determines the passage passes the structured content criteria, then the process includes the passage as structured content in candidate answer passages being generated.

If the process determines the passage does not pass the structured content criteria, the process may then determine if more content is to be processed for the candidate answer passage.

If there isn’t more content to be processed, then the process sends the candidate answer passages to the answer passage scorer.

Query Dependent and Query Independent Ranking Signals

I have described how candidate answer passages may be selected. Once they are, they need to be scored. In the post I wrote about the update of this patent Featured Snippet Answer Scores Ranking Signals, I wrote about the query dependent and query independent ranking signals that this patent had described in detail. I wanted to write about the other details from this patent that I hadn’t covered in that post.

Other patents I have written about involving answer passages describe the scoring process and adjusting those scores based on the context of headings on the resources those passages are found upon.

Selecting Candidate Answer Passages is an original blog post first published on Go Fish Digital.