Automatic identification of strokes in videos of tennis matches caught my attention recently. I was wondering if I could video my weekend matches and have some method of automatically indexing my strokes. I wanted to see if there was a pattern in the way I played – why did I lose or win? What sequence of shots performed best for me?

1 Introduction

Computer vision and visual recognition has seen significant investment and research over the last few decades. Deep neural networks are able to generalise to such a level where strokes can be identified at a very granular level. Not having much experience with these, the below paper was a nice segway.

https://cs230.stanford.edu/files_winter_2018/projects/6945761.pdf

Action Recognition in Tennis using Deep Neural Networks

The ‘THree dimEnsional TennIs Shots (THETIS) dataset: a sport based human action dataset comprised of the 12 basic tennis shots captured by Kinect’ was used as the test and training data in this paper. Essentially, it has tagged over 1600 videos of tennis strokes (albeit not on a tennis court), across 12 classes e.g forehand, forehand slice, backhand, 2 handed backend, serve etc.

The paper uses a pre-trained convolutional neural network (CNN) layer to generalise visual features from an input – a series of video frames, and provides an output – the class of tennis stroke e.g backhand, forehand etc. A model such as InceptionV3 has been pre-trained using millions of images, a vast dataset that would take forever to train on our local machines. This is what is referred to as transfer learning – we are transferring the learned weights to our own example.

2 Model and Method

For each video, I have chosen only a sample of the frames in the middle of each video to capture the key information of the tennis stroke and making computational time more manageable. After saving the frames of each video, the RGB pixels are normalised using the Inception V3 preprocess_input function, to allow for faster neural network computations. Next, the useful Keras ImageDataGenerator allows for sequential loading in of a large dataset in batches, which is especially useful when running Jupyter on a single CPU. We define two ImageDataGenerators, a training set, with a 10% validation split and our test set. Roughly, a 70-10-20 train, validation, test set is used. As you can see, in absolute terms this results in 20592 training images and 5117 test images across 5 classes: backhand, backhand2hand, forehand, service, smash.

ROWS=480
COLS=600
from tensorflow.keras.preprocessing.image import ImageDataGenerator
from tensorflow.keras.applications.inception_v3 import preprocess_input
from tensorflow import test

train_idg = ImageDataGenerator(preprocessing_function=preprocess_input, validation_split=0.1)
test_idg = ImageDataGenerator(preprocessing_function=preprocess_input)
train_gen = train_idg.flow_from_directory(
    '/Users/adamwatson/Desktop/TENNIS/InputFrames/',
    target_size=(ROWS, COLS), 
)
test_gen = train_idg.flow_from_directory(
    '/Users/adamwatson/Desktop/TENNIS/TestFrames/',
    target_size=(ROWS, COLS)
)
Found 20592 images belonging to 5 classes.
Found 5117 images belonging to 5 classes.

Now our input data is prepared across our 5 classes, let’s define our base Inception model so that we can freeze the pre trained ImageNet layers. Adding to the model sequentially, we add a dropout layer with rate 0.1 to remove neurons at each epoch of our training, forcing learning to spread out across the network. Next, we add a GlobalAveragePooling2D layer to reduce the number of trainable parameters to a more manageable size, which is then fed into a Dense softmax layer to output probabilities of each class. What are we trying to minimise? We define a categorical cross entropy loss function.

from tensorflow.keras.models import Sequential, Model
from tensorflow.keras.applications.inception_v3 import InceptionV3
from tensorflow.keras import optimizers
from tensorflow.keras.layers import Dense, Dropout, Flatten, GlobalAveragePooling2D

# (480,600,3)
input_shape = (ROWS, COLS, 3)
nclass = len(train_gen.class_indices)
base_model = InceptionV3(weights='imagenet', 
                                include_top=False, 
                                input_shape=(ROWS, COLS,3))
# Freeze inception layers (don't train these layers).
base_model.trainable = False
add_model = Sequential()
add_model.add(base_model)
add_model.add(Dropout(rate=0.1))
add_model.add(GlobalAveragePooling2D())
add_model.add(Dense(nclass, 
                    activation='softmax'))
model = add_model
model.compile(loss='categorical_crossentropy', 
              optimizer=optimizers.SGD(learning_rate=0.05, 
                                       momentum=0.9),
              metrics=['accuracy'])
model.summary()
Model: "sequential_7"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
inception_v3 (Functional)    (None, 13, 17, 2048)      21802784  
_________________________________________________________________
global_average_pooling2d_7 ( (None, 2048)              0         
_________________________________________________________________
dense_7 (Dense)              (None, 5)                 10245     
=================================================================
Total params: 21,813,029
Trainable params: 10,245
Non-trainable params: 21,802,784
_________________________________________________________________

The following combination of parameters were used in training the model:

  • 10 Epochs.
  • 0.05 Learning Rate.
  • 0.1 Dropout Rate
  • 128 Batch Size
Model Accuracy with increasing epochs

The model was able to achieve an approximate accuracy of 66% across the frames of each video. Whilst this might seem low at first, we then classify each test video by taking the modal class of all test frames belonging to that video. To illustrate an example, the below smash video had classifications split across 14 frames.

Videobackhandbackhand2handsforehandservicesmash
p9_smash_s3.avi50216
Classification Counts

The class smash has the highest count, thus our model is able to predict a correct class for the correspond video. How does this perform across our full test dataset?

3 Conclusion

The model achieved around a 79% accuracy score on the video test set, with the below confusion matrix.

Model Confusion Matrix
backhandbackhand2handsforehandservicesmash
Class Performance70%83%76.3%80.4%54%
Classification Scores

The model looks to perform better on classes which have more videos, which is expected behaviour. Smash strokes tended to be incorrectly marked as a service, due to the similar mechanics of each stroke. Backhand was the lowest score out of the three strokes with the largest datasets: [backhand, forehand, service], and struggled to completely differentiate between single and two handed shots. Service had the highest score out of the three, unsurprisingly struggling to differentiate with the smash strokes.

There are a number of improvements that can be made. First, after reviewing a large sample of the videos, it is clear that the quality of stroke technique is poor, or unrepresentative of a professional’s tennis swing. In some cases, the players weren’t holding the racket correctly. Furthermore, these are all filmed on a basketball court, with people frequently playing basketball behind. This has the effect of muddying the input data. Second, I would look to use the skeletal THETIS dataset, which contains the movement of the skeleton of each player. In tennis, where the arms are in relation to the rest of the body is very important in detecting a tennis stroke. This, I believe, would be a good addition to the input dataset and improve on the training fits. Finally, I would look to incorporate the lines of the tennis court into the model. Relative position of the player is important information in detecting tennis strokes. A right handed player mostly hits a forehand in the right section of the court, for example. The jupyter notebook for this project can be found below:

https://github.com/watsonam/video-recognition

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *