The Kalman filter (KF) is a widely-used algorithm for tracking dynamical systems that can be faithfully captured by state space (SS) models. The need to fully describe an SS model limits its applicability under complex settings, e.g., when tracking based on visual or graphical data. This challenge can be treated by mapping the measurements into latent features obeying some postulated closed-form SS model, and applying the KF in the latent space. However, the validity of this approximated SS model may constitute a limiting factor. In this work we tackle the challenges associated with tracking from high-dimensional measurements by jointly learning the KF along with the latent space mapping. Our proposed approach combines a learned encoder while tracking in the latent space using the recently proposed data-driven Kalman-Net, and having both modules jointly tuned from data. Our empirical results demonstrate that the proposed approach achieves improved performance over both model-based and data-driven techniques, by learning a surrogate latent representation that most facilitates tracking.