create small balanced training set with K shots per class
assign random initialized weights to box prediction networks for novel classes
fine-tune only the box classification and regression networks
used smaller learning rate (reduced by 20)
Cosine similarity