The automatic transcription of speech to text in broadcast audio is a challenging task considering the variabilities in the data. In order to perform transcription preprocessing of the broadcast audio data is necessary so that the desired accuracy can be achieved. Accordingly preprocessing tasks such as speech vs music classification, clean speech vs speech with background music classification and speech enhancement of speech with background music segments is performed in a pipeline. The output of the preprocessing stages is then passed through a speech recognizer to perform speech to text transcription having a good accuracy over the case of directly passing the broadcast audio through the speech recognition system without preprocessing. Note that the models for the speech recognizer are trained on clean speech which is why such preprocessing steps are required so as to obtain ideally a clean speech as input to the speech recognizer.