Matching cancer ontology with recurrent neural network

For people who have tried to build an oncology database or automated pipeline, one of the task is to define a standardized cancer type terminologies. Luckily, some organizations have took efforts to do this, such as OncoTree ( ) established by MSK. Still, it is a painful task to manually match the incoming cancer type to one of the closest terms in your oncology pool. Perhaps we can try to apply neural machine translation to automate this term alignment. Specifically, this article focus on attention based recurrent neural network with long short-term memory.

As a supervised learning, first we need to generate a dataset of input term - translated term pairs using tool like Word2Vec.

Recurrent neural network (RNN)


Recurrent neural network is suitable for this kind of tasks. Let's first see how RNN is different from a typical neural network.

The figure above shows that typical NN can only pass signal forward (backpropagation allows signal pass backward) while RNN can pass signal between activation node in the same hidden layers. This feature allows RNN to have "memory" by passing the information gained during previous inputs to the current activation node.

We can see from the formula and figure above that the hidden state at time step t is the function of weighted input of current input plus the hidden state of previous time step multiplied by its own hidden-state-to-hidden-state matrix. This "HHM" feature allows RNN to determine the output based on the previous inputs. RNN is therefore good at time dependencies application, such as context modeling, natural language processing and machine translation. Here we just alter the its function from translation between different languages to translation between synonyms.

Vanishing gradients

Problem with this "memory" feature is that, when the input sequence is long, the long multiplication steps make the inheritance of early input vanishing as the information pass.

Here we need to make a choice:

  • Input as word ( 'lung' or 'carcinoma' as single input ): Then we can limit most of the input sequence in four or five and the vanish can be kept at minimum. However, we must have a large training set to cover all the terms as the model can't generate new word.

  • Input as letter ( 'l' or 'u' as single input): We don't have to worry about "new word" problem. However, the vanishing is not tolerable.

Long short-term memory(LSTM)

Long short-term memory is the way to solve vanishing gradients. For each input, it creates a cell with gates that control the information flow. The current input and recurrent information from previous inputs ( check the blue ball in top figure ) come into the cell as well as each gate. These gates then decide based on the inputs whether to read ( from the previous inputs ), write (to next cell) or forget ( just like computer memory ).