Data#
As data moves along the pipelines it’s easy to forget what fields are in the different structures stored, and with that we can only guess which structure we might need to implement a new feature or how we can optimise the existing structres. This becomes specially important when the structres are stored in binary type files that cannot be easily viewed. As such, this file contains information about each of the structures that are present in the poject.
Note
The data directory can be found in the conf/base.catalog.yml
file.
Newspapers ID (name: newspapers_id
)#
This is a JSON file that contains the Tweeter ID of the different users as well as their Twitter handle, which is the identifier used across the project for the newspaper.
Structure
{
"twitter handle": "twitter id"
}
Newspapers Raw tweets (name: newspapers_raw_tweets
)#
A folder containing JSON files straight form Twitter’s API. Each file corresponds to a week’s tweet from each newspaper.
Naming convention:
{year}w{week number}_data_{twitter handle}.json
Structure
{
"data": [
{
"edit_history_tweet_ids": [
"ID list if tweet has been edited"
],
"public_metrics": {
"retweet_count": "int: # retweets",
"reply_count": "int: # replies",
"like_count": "int: # likes",
"quote_count": "int: # quotes",
"bookmark_count": "int: # bookmarks",
"impression_count": "int: # of impressions"
},
"conversation_id": "int: ID of the conversation",
"created_at": "datetime ISO: Time created",
"text": "string: Tweet's text",
"id": "int: Tweet's ID",
"possibly_sensitive": "bool: Identifies a tweets as possibly sensitive or not"
},
]
}
Raw Data (name: raw_data
)#
Raw data compiled into Dataframes
, saved into Feather
format,
one per week.
Naming Convention: data_raw-({year}, {week}).feather
Structure
# |
Column |
Type |
---|---|---|
0 |
index |
int64 |
1 |
edit_history_tweet_ids |
object |
2 |
created_at |
datetime64[ns, UTC] |
3 |
id |
object |
4 |
conversation_id |
object |
5 |
possibly_sensitive |
bool |
6 |
text |
object |
7 |
retweet_count |
int64 |
8 |
reply_count |
int64 |
9 |
like_count |
int64 |
10 |
quote_count |
int64 |
11 |
bookmark_count |
int64 |
12 |
impression_count |
int64 |
13 |
referenced_tweets |
object |
14 |
newspaper |
object |
15 |
year |
UInt32 |
16 |
week |
UInt32 |
17 |
year_week |
object |
Clean Data (name: clean_data
)#
Dataframes
after going through the cleaning_and_preprocessing
pipeline, saved as Feather
files.
Naming Convention: data_clean-({year}, {week}).feather
Structure
# |
Column |
Dtype |
---|---|---|
0 |
index |
int64 |
1 |
edit_history_tweet_ids |
object |
2 |
created_at |
datetime64[ns, UTC] |
3 |
id |
object |
4 |
conversation_id |
object |
5 |
possibly_sensitive |
bool |
6 |
text |
object |
7 |
retweet_count |
int64 |
8 |
reply_count |
int64 |
9 |
like_count |
int64 |
10 |
quote_count |
int64 |
11 |
bookmark_count |
int64 |
12 |
impression_count |
int64 |
13 |
referenced_tweets |
object |
14 |
newspaper |
object |
15 |
year |
UInt32 |
16 |
week |
UInt32 |
17 |
year_week |
object |
18 |
mentions |
object |
19 |
hasthags |
object |
20 |
text_clean |
object |
Corpus (name: corpus
)#
Dataframes
after the first step in the feature_engineering
pipeline. Contains only the original text and the cleaned corpus.
Naming Convention: corpus-({year}, {week}).feather
Structure
# |
Column |
Type |
---|---|---|
0 |
index |
int64 |
1 |
id |
object |
2 |
created_at |
datetime64[ns, UTC] |
3 |
newspaper |
object |
4 |
text |
object |
5 |
corpus |
object |
Data DTM (name: data_dtm
)#
Dataframes
wit columns containing the data necesary to perform NLP
analysis as well as build a Document-Term-Matrix
. The format
selected was pickle
because the Dataframes
contain objects that
cannot be serialized into a feather
format.
Naming Convention: data_dtm-({year}, {week}).pkl
Structure
# |
Column |
Type |
---|---|---|
0 |
index |
int64 |
1 |
id |
object |
2 |
created_at |
datetime64[ns, UTC] |
3 |
newspaper |
object |
4 |
text |
object |
5 |
corpus |
object |
6 |
doc |
object |
7 |
token |
object |
8 |
lemma |
object |
DTM (name:dtm
)#
Dataframes
where the index contains the ID
of each tweet and the
columns correspond to the words in each tweet. The format selected is
Feather
, as the cell values are counts of the repetitions to the
word.
Naming Convention: dtm-({year}, {week}).feather
Structure
id |
… words … |
---|---|
… |
… |
DTM Newspaper (name: dtm_newspaper
)#
Dataset with Dataframes
are stored in a folder, one per week. Each
Dataframe
has columns corresponding to the handles of each
newspaper, and the index corresponds to the words used in that week. The
values are the weekly count of each words.
Naming Convention: dtm_newspaper-({year}, {week}).feather
Structure
word |
|
---|---|
… |
… |
Sentiment and Emotion analyzer#
`PySentimiento
<https://github.com/pysentimiento/pysentimiento>`__
models stored as pickle objects.
Name: sentiment_analyzer
Name: emotion_analyzer
Corpus Sentiment-Emotion (name: corpus_sentimen-emotion
)#
Collection of Dataframes
after sentiment and emotion analysis. The
Dataframes
contain the probabilities of the corpus of being
POSITIVE
, NEGATIVE
or NEUTRAL
as well as the different
emotions.
Naming Convention:
corpus_sentiment_emotion-({year}, {week}).feather
Corpus Topic (name: corpus_topic
)#
Collection of Dataframes
after Topic Modeling has been performed.
Naming Convention: corpus_topic-({year}, {week}).feather