{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "Anoche me reí mucho con el hashtag [#LaGente](https://twitter.com/search?q=%23lagente&src=typd), que se viralizó mientras Alejandro Fantino entrevistaba, una vez más, al inefable candidato presidencial Sergio Massa. \n", "\n", "
"Me comprometo no ante vos, ante #LaGente". Lo harta que me tiene este tipo no tiene nombre.
— Fenicia (@zeinicienta) marzo 12, 2015
\n",
"\n",
"\n",
"Es insoportable la cantidad de frases hechas y lugares comunes que tira @SergioMassa. Demagogo berreta. #LaGente
— Javier Smaldone (@mis2centavos) marzo 12, 2015
\n",
"\n",
"\n",
"Quiero contar la cantidad de veces que +a dice #LaGente
— memorex (@memorex) marzo 12, 2015
\n",
"\n",
"\n",
"Me acordé entonces de un [post de Zulko](http://zulko.github.io/blog/2014/06/21/some-more-videogreping-with-python/), cuyo blog es un compilado de gemas ñoñamente divertidas. Allí muestra cómo recortar automáticamente los pedacitos de un video que mencionen una palabra o frase, basándose en las marcas de tiempo del archivo de subtítulos, utilizando su maravillosa biblioteca [Moviepy](http://zulko.github.io/moviepy/) y un poco de Python. Más o menos lo que hace [videogrep](https://github.com/antiboredom/videogrep), pero más prolijo. \n",
"\n",
"La herramienta [youtube-dl](http://rg3.github.io/youtube-dl/) (que también es genial y hecha en Python), permite no sólo bajar videos de youtube y los subtitulos existentes, sino que también puede bajar el \"subtitulo automático\". En general son bastante malos pero es suficientemente efectivo para encontrar pequeñas frases. \n",
"\n",
"## Todo sea por \"la gente\": manos a la obra\n",
"\n",
"Lo primero que necesitamos es una lista de videos donde Sergio Massa hable. Hice una [búsqueda](https://www.youtube.com/results?search_query=entrevista+sergio+massa), decidí ignorar algunos (parodias, por ejemplo) y generé una lista. Hay varias maneras de obtener este listado de las primeras paginas de resultados, yo utilicé el rústico y efectivo webscrapping:"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"['https://www.youtube.com/watch?v=8pP8G3fSAcY',\n",
" 'https://www.youtube.com/watch?v=g6QSwxUo1aw',\n",
" 'https://www.youtube.com/watch?v=_9FN6CI8fD4',\n",
" 'https://www.youtube.com/watch?v=5wqwNDpkZOo',\n",
" 'https://www.youtube.com/watch?v=V865E4mBiHU',\n",
" 'https://www.youtube.com/watch?v=TPrGNJnMS9U',\n",
" 'https://www.youtube.com/watch?v=SVTl11hG9Gs',\n",
" 'https://www.youtube.com/watch?v=Df_dwb5XHQM',\n",
" 'https://www.youtube.com/watch?v=sptBkyfq1VU',\n",
" 'https://www.youtube.com/watch?v=tzjz1xrNu3k',\n",
" 'https://www.youtube.com/watch?v=k-CGbuOo8do',\n",
" 'https://www.youtube.com/watch?v=_L-B_wHsEec',\n",
" 'https://www.youtube.com/watch?v=iFOABIQdo9Q',\n",
" 'https://www.youtube.com/watch?v=WOlRIKGrBWY',\n",
" 'https://www.youtube.com/watch?v=a-mCgN6W9ek',\n",
" 'https://www.youtube.com/watch?v=x5vhchv3zAY',\n",
" 'https://www.youtube.com/watch?v=bi5eK7i59w0',\n",
" 'https://www.youtube.com/watch?v=VNHV3D_6o4E',\n",
" 'https://www.youtube.com/watch?v=MWVZ6JDU9V8',\n",
" 'https://www.youtube.com/watch?v=v-JmdgVZqVc',\n",
" 'https://www.youtube.com/watch?v=FBFHpdxsyYU',\n",
" 'https://www.youtube.com/watch?v=WXmTc83l1sQ',\n",
" 'https://www.youtube.com/watch?v=GfNgds5vS60',\n",
" 'https://www.youtube.com/watch?v=UHRa34A6rDg',\n",
" 'https://www.youtube.com/watch?v=xVU-EjnuksU',\n",
" 'https://www.youtube.com/watch?v=-IXymTZZM6o',\n",
" 'https://www.youtube.com/watch?v=tzvwDTPyTHQ',\n",
" 'https://www.youtube.com/watch?v=a19z6EVWpQ4',\n",
" 'https://www.youtube.com/watch?v=rAOvF8X_nzM',\n",
" 'https://www.youtube.com/watch?v=wtvl4esdMGU',\n",
" 'https://www.youtube.com/watch?v=1YPHDDH1Az0',\n",
" 'https://www.youtube.com/watch?v=w7TnghsrJUo',\n",
" 'https://www.youtube.com/watch?v=qBT-6HpSrwc',\n",
" 'https://www.youtube.com/watch?v=JM-xblTxLGc',\n",
" 'https://www.youtube.com/watch?v=kMymsVsmETY',\n",
" 'https://www.youtube.com/watch?v=K1-dfiVfbOI',\n",
" 'https://www.youtube.com/watch?v=VnoiHVlR-So',\n",
" 'https://www.youtube.com/watch?v=hMTzJyLiXE4',\n",
" 'https://www.youtube.com/watch?v=VGQPNQ1Bhkg',\n",
" 'https://www.youtube.com/watch?v=0oR4z7SsY14',\n",
" 'https://www.youtube.com/watch?v=Cl4r8h_Hlak',\n",
" 'https://www.youtube.com/watch?v=gJFmek-YgYo',\n",
" 'https://www.youtube.com/watch?v=9VQ7Ov5W_tM',\n",
" 'https://www.youtube.com/watch?v=rKwKImVrYu4',\n",
" 'https://www.youtube.com/watch?v=LJwj9SHC9EU',\n",
" 'https://www.youtube.com/watch?v=-08OEpFThiw',\n",
" 'https://www.youtube.com/watch?v=BPJBl5y2P2g',\n",
" 'https://www.youtube.com/watch?v=MvkXlg9ZbL4',\n",
" 'https://www.youtube.com/watch?v=7KgIa4fX_Ng',\n",
" 'https://www.youtube.com/watch?v=upNLrHtzeBI',\n",
" 'https://www.youtube.com/watch?v=Y-norf1BKAs',\n",
" 'https://www.youtube.com/watch?v=QMvAl_fxQSA',\n",
" 'https://www.youtube.com/watch?v=3os_uXUOvcM',\n",
" 'https://www.youtube.com/watch?v=ZE_aChIEELo',\n",
" 'https://www.youtube.com/watch?v=iKI-8ceuR-A',\n",
" 'https://www.youtube.com/watch?v=CASdYLquQII',\n",
" 'https://www.youtube.com/watch?v=5cvyi1CcpYs',\n",
" 'https://www.youtube.com/watch?v=NVEw-YIAy5A',\n",
" 'https://www.youtube.com/watch?v=yMXn04-GQTY',\n",
" 'https://www.youtube.com/watch?v=RCCzZGcGg5k',\n",
" 'https://www.youtube.com/watch?v=FqMFKGsXLOE',\n",
" 'https://www.youtube.com/watch?v=MVOvQb8KBm0',\n",
" 'https://www.youtube.com/watch?v=ENvWfMwnJ_0',\n",
" 'https://www.youtube.com/watch?v=bs7xGm293Vs',\n",
" 'https://www.youtube.com/watch?v=7OvrK-U-axI',\n",
" 'https://www.youtube.com/watch?v=VHeWqPqs4vo',\n",
" 'https://www.youtube.com/watch?v=nVOEi9FESn8',\n",
" 'https://www.youtube.com/watch?v=eikTWAvFwTE',\n",
" 'https://www.youtube.com/watch?v=BU2amn3QdWk',\n",
" 'https://www.youtube.com/watch?v=GiB1pOuEvqg',\n",
" 'https://www.youtube.com/watch?v=GAPN17lTJ9c',\n",
" 'https://www.youtube.com/watch?v=4Ja1uZbMM8E',\n",
" 'https://www.youtube.com/watch?v=F1dAfCR4rc0',\n",
" 'https://www.youtube.com/watch?v=334O9xh-CQY',\n",
" 'https://www.youtube.com/watch?v=KgNmw3sJ0g8',\n",
" 'https://www.youtube.com/watch?v=-SQSue4-PLk',\n",
" 'https://www.youtube.com/watch?v=HPE4PHlYySo']"
]
},
"execution_count": 1,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from pyquery import PyQuery\n",
"links = []\n",
"skip = ('M0yuFHbhYLY','TLmMh9Qvmic', 'rY4Hwvn6GlA')\n",
"\n",
"for page in range(1, 5):\n",
" pq = PyQuery('https://www.youtube.com/results?search_query=entrevista+sergio+massa&page=%s' % page)\n",
" pq.make_links_absolute()\n",
" links.extend([pq(a).attr('href') for a in pq('a.yt-uix-tile-link') if pq(a).attr('href').split('v=')[1] not in skip])\n",
"links"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Luego, el paso lento: bajar los videos. Al parecer, Youtube no genera un subtitulo automático para videos demasiado largo, así que limité hasta 30 minutos. "
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"for link in links:\n",
" !youtube-dl --write-auto-sub --sub-lang es --max-filesize 30.00m {link}"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Con el material crudo disponible (aunque puede ser que no se hayan encontrado subtitulos para todos los videos), podemos copiar descaradamente partes del código de Zulko (levemente adaptado)\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"import re \n",
"import os\n",
"import glob\n",
"import random\n",
"from moviepy.editor import VideoFileClip, concatenate, TextClip, CompositeVideoClip\n",
"\n",
"\n",
"def convert_time(timestring):\n",
" \"\"\" Converts a string into seconds \"\"\"\n",
" nums = [float(t) for t in re.findall(r'\\d+', timestring)]\n",
" return 3600 * nums[0] + 60*nums[1] + nums[2] + nums[3]/1000\n",
"\n",
"\n",
"def get_time_texts(file):\n",
" with open(file) as f:\n",
" lines = f.readlines()\n",
"\n",
" times_texts = []\n",
" current_times , current_text = None, \"\"\n",
" for line in lines:\n",
" times = re.findall(\"[0-9]*:[0-9]*:[0-9]*,[0-9]*\", line)\n",
" if times != []:\n",
" current_times = [convert_time(t) for t in times]\n",
" elif line == '\\n':\n",
" times_texts.append((current_times, current_text))\n",
" current_times, current_text = None, \"\"\n",
" elif current_times is not None:\n",
" current_text = current_text + line.replace(\"\\n\",\" \")\n",
" return times_texts\n",
"\n",
"def find_word(word, times_texts, padding=.4):\n",
" \"\"\" Finds all 'exact' (t_start, t_end) for a word \"\"\"\n",
" matches = [re.search(word, text)\n",
" for (t,text) in times_texts]\n",
" return [(t1 + m.start()*(t2-t1)/len(text) - padding,\n",
" t1 + m.end()*(t2-t1)/len(text) + padding)\n",
" for m,((t1,t2),text) in zip(matches, times_texts)\n",
" if (m is not None)]\n",
"\n",
"\n",
"def get_subclips(video_path, cuts): \n",
" video = VideoFileClip(video_path)\n",
" return [video.subclip(start, end) for (start,end) in cuts]\n",
"\n",
"\n",
"def get_all_subclips_for(word, pattern='*.mp4', sub_ext='.es.srt', shuffle=True):\n",
" subclips = []\n",
" for mp4 in glob.glob(pattern):\n",
" sub = os.path.splitext(mp4)[0] + sub_ext\n",
" try:\n",
" times = find_word(word, get_time_texts(sub))\n",
" except IOError:\n",
" # ignore video if it hasn't subtitle\n",
" continue\n",
" cuts = get_subclips(mp4, times)\n",
" subclips.extend(cuts)\n",
" if shuffle:\n",
" random.shuffle(subclips)\n",
" return subclips\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"La función `get_all_subclip` recibe la frase a buscar y devuelve un listado de segmentos donde, muy probablemente, se pronuncia.\n"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"77"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"gente = get_all_subclips_for('la gente')\n",
"len(gente)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"El problema es que aunque es muy probable que sea Sergio Massa el que diga \"la gente\" en sus entrevistas, a veces es el entrevistador, a veces youtube entendió mal al desgrabar y a veces el código recortador la pifia. Por este motivo hay que descartar los segmentos que no sirven. \n",
"\n",
"Se me ocurrió hacerlo visualmente: los pegué todos, superponiendo el índice al que corresponde cada segmento, para luego anotar los que no sirven y filtrarlos en otra pasada. "
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"def make_preview(subclips):\n",
" subclips_ = []\n",
" for (i, clip) in enumerate(subclips):\n",
" txt_clip = TextClip(str(i),fontsize=70, color='white')\n",
" txt_clip = txt_clip.set_pos('center').set_duration(clip.duration)\n",
" clip = CompositeVideoClip([clip, txt_clip])\n",
" subclips_.append(clip)\n",
"\n",
" final = concatenate(subclips_, method='compose')\n",
" final.write_videofile('preview.webm', codec='libvpx', fps=24) \n"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[MoviePy] >>>> Building video preview.webm\n",
"[MoviePy] Writing audio in previewTEMP_MPY_wvf_snd.ogg\n",
"[MoviePy] Done.\n",
"[MoviePy] Writing video preview.webm\n",
"[MoviePy] Done.\n",
"[MoviePy] >>>> Video ready: preview.webm \n",
"\n"
]
}
],
"source": [
"make_preview(gente)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"El [resultado](http://youtu.be/LU76nVlBqdE) me permitió hacer el tamizado"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"ignore = [2, 3, 8, 12, 17, 19, 25, 28, 32, 36, 38, 40, 41, 44, 49, 55, 56, 61, 62, 66, 73, 74]\n",
"subclips_cleaned = [i for j, i in enumerate(gente) if j not in ignore]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Aunque no tengo idea de edición de videos, y porque de verdad creo que es un demamogo impresentable que no debería presidir ni una junta vecinal, quería darle un toque final, con una pequeña frase"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"import numpy as np\n",
"from moviepy.video.tools.segmenting import findObjects\n",
"\n",
"def arrive(screenpos,i,nletters):\n",
" v = np.array([-1,0])\n",
" d = lambda t : max(0, 3-3*t)\n",
" return lambda t: screenpos-400*v*d(t-0.2*i)\n",
"\n",
"screensize = (640,360)\n",
"txtClip = TextClip('Yn tragr ab rf obyhqn'.decode('rot13'), color='white', font=\"Amiri-Bold\", kerning=5, fontsize=50)\n",
"cvc = CompositeVideoClip( [txtClip.set_pos('center')],\n",
" size=screensize)\n",
"\n",
"letters = findObjects(cvc) # a list of ImageClips\n",
"\n",
"def moveLetters(letters, funcpos):\n",
" return [ letter.set_pos(funcpos(letter.screenpos,i,len(letters)))\n",
" for i,letter in enumerate(letters)]\n",
"\n",
"ending = CompositeVideoClip(moveLetters(letters, arrive), size=screensize).subclip(0, 10)\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[MoviePy] >>>> Building video massa_lagente_final.webm\n",
"[MoviePy] Writing audio in massa_lagente_finalTEMP_MPY_wvf_snd.ogg\n",
"[MoviePy] Done.\n",
"[MoviePy] Writing video massa_lagente_final.webm\n",
"[MoviePy] Done.\n",
"[MoviePy] >>>> Video ready: massa_lagente_final.webm \n",
"\n"
]
}
],
"source": [
"# le damos una mezcladita más\n",
"random.shuffle(subclips_cleaned)\n",
"subclips_cleaned.append(ending)\n",
"make_final(subclips_cleaned, 'massa_lagente_final.webm')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Y este es el resultado:\n",
"\n",
""
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.4.0"
}
},
"nbformat": 4,
"nbformat_minor": 0
}